• Open

    Trace norm regularization for multi-task learning with scarce data. (arXiv:2202.06742v2 [stat.ML] UPDATED)
    Multi-task learning leverages structural similarities between multiple tasks to learn despite very few samples. Motivated by the recent success of neural networks applied to data-scarce tasks, we consider a linear low-dimensional shared representation model. Despite an extensive literature, existing theoretical results either guarantee weak estimation rates or require a large number of samples per task. This work provides the first estimation error bound for the trace norm regularized estimator when the number of samples per task is small. The advantages of trace norm regularization for learning data-scarce tasks extend to meta-learning and are confirmed empirically on synthetic datasets.
    One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones. (arXiv:2202.07028v3 [cs.AI] UPDATED)
    We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short horizons. However, when it comes to long-horizon tasks with extended sequences of actions, an agent can easily ignore some instructions or get stuck in the middle of the long instructions and eventually fail the task. To address this challenge, we propose a model-agnostic milestone-based task tracker (M-TRACK) to guide the agent and monitor its progress. Specifically, we propose a milestone builder that tags the instructions with navigation and interaction milestones which the agent needs to complete step by step, and a milestone checker that systemically checks the agent's progress in its current milestone and determines when to proceed to the next. On the challenging ALFRED dataset, our M-TRACK leads to a notable 33% and 52% relative improvement in unseen success rate over two competitive base models.
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v2 [cs.LG] UPDATED)
    Parameter estimation in empirical fields is usually undertaken using parametric models, and such models readily facilitate statistical inference. Unfortunately, they are unlikely to be sufficiently flexible to be able to adequately model real-world phenomena, and may yield biased estimates. Conversely, non-parametric approaches are flexible but do not readily facilitate statistical inference and may still exhibit residual bias. We explore the potential for Influence Functions (IFs) to (a) improve initial estimators without needing more data (b) increase model robustness and (c) facilitate statistical inference. We begin with a broad introduction to IFs, and propose a neural network method 'MultiNet', which seeks the diversity of an ensemble using a single architecture. We also introduce variants on the IF update step which we call 'MultiStep', and provide a comprehensive evaluation of different approaches. The improvements are found to be dataset dependent, indicating an interaction between the methods used and nature of the data generating process. Our experiments highlight the need for practitioners to check the consistency of their findings, potentially by undertaking multiple analyses with different combinations of estimators. We also show that it is possible to improve existing neural networks for `free', without needing more data, and without needing to retrain them.
    Predicting the Thermal Sunyaev-Zel'dovich Field using Modular and Equivariant Set-Based Neural Networks. (arXiv:2203.00026v2 [astro-ph.CO] UPDATED)
    Theoretical uncertainty limits our ability to extract cosmological information from baryonic fields such as the thermal Sunyaev-Zel'dovich (tSZ) effect. Being sourced by the electron pressure field, the tSZ effect depends on baryonic physics that is usually modeled by expensive hydrodynamic simulations. We train neural networks on the IllustrisTNG-300 cosmological simulation to predict the continuous electron pressure field in galaxy clusters from gravity-only simulations. Modeling clusters is challenging for neural networks as most of the gas pressure is concentrated in a handful of voxels and even the largest hydrodynamical simulations contain only a few hundred clusters that can be used for training. Instead of conventional convolutional neural net (CNN) architectures, we choose to employ a rotationally equivariant DeepSets architecture to operate directly on the set of dark matter particles. We argue that set-based architectures provide distinct advantages over CNNs. For example, we can enforce exact rotational and permutation equivariance, incorporate existing knowledge on the tSZ field, and work with sparse fields as are standard in cosmology. We compose our architecture with separate, physically meaningful modules, making it amenable to interpretation. For example, we can separately study the influence of local and cluster-scale environment, determine that cluster triaxiality has negligible impact, and train a module that corrects for mis-centering. Our model improves by 70 % on analytic profiles fit to the same simulation data. We argue that the electron pressure field, viewed as a function of a gravity-only simulation, has inherent stochasticity, and model this property through a conditional-VAE extension to the network. This modification yields further improvement by 7 %, it is limited by our small training set however. (abridged)
    Tackling covariate shift with node-based Bayesian neural networks. (arXiv:2206.02435v2 [stat.ML] UPDATED)
    Bayesian neural networks (BNNs) promise improved generalization under covariate shift by providing principled probabilistic representations of epistemic uncertainty. However, weight-based BNNs often struggle with high computational complexity of large-scale architectures and datasets. Node-based BNNs have recently been introduced as scalable alternatives, which induce epistemic uncertainty by multiplying each hidden node with latent random variables, while learning a point-estimate of the weights. In this paper, we interpret these latent noise variables as implicit representations of simple and domain-agnostic data perturbations during training, producing BNNs that perform well under covariate shift due to input corruptions. We observe that the diversity of the implicit corruptions depends on the entropy of the latent variables, and propose a straightforward approach to increase the entropy of these variables during training. We evaluate the method on out-of-distribution image classification benchmarks, and show improved uncertainty estimation of node-based BNNs under covariate shift due to input perturbations. As a side effect, the method also provides robustness against noisy training labels.
    CoCon: A Self-Supervised Approach for Controlled Text Generation. (arXiv:2006.03535v3 [cs.CL] UPDATED)
    Pretrained Transformer-based language models (LMs) display remarkable natural language generation capabilities. With their immense potential, controlling text generation of such LMs is getting attention. While there are studies that seek to control high-level attributes (such as sentiment and topic) of generated text, there is still a lack of more precise control over its content at the word- and phrase-level. Here, we propose Content-Conditioner (CoCon) to control an LM's output text with a content input, at a fine-grained level. In our self-supervised approach, the CoCon block learns to help the LM complete a partially-observed text sequence by conditioning with content inputs that are withheld from the LM. Through experiments, we show that CoCon can naturally incorporate target content into generated texts and control high-level text attributes in a zero-shot manner.
    Refined Convergence and Topology Learning for Decentralized Optimization with Heterogeneous Data. (arXiv:2204.04452v2 [cs.LG] UPDATED)
    One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called \emph{neighborhood heterogeneity}, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts in decentralized learning. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.
    Trainability of Dissipative Perceptron-Based Quantum Neural Networks. (arXiv:2005.12458v2 [quant-ph] UPDATED)
    Several architectures have been proposed for quantum neural networks (QNNs), with the goal of efficiently performing machine learning tasks on quantum data. Rigorous scaling results are urgently needed for specific QNN constructions to understand which, if any, will be trainable at a large scale. Here, we analyze the gradient scaling (and hence the trainability) for a recently proposed architecture that we called dissipative QNNs (DQNNs), where the input qubits of each layer are discarded at the layer's output. We find that DQNNs can exhibit barren plateaus, i.e., gradients that vanish exponentially in the number of qubits. Moreover, we provide quantitative bounds on the scaling of the gradient for DQNNs under different conditions, such as different cost functions and circuit depths, and show that trainability is not always guaranteed.
    Looper: An end-to-end ML platform for product decisions. (arXiv:2110.07554v7 [cs.LG] UPDATED)
    Modern software systems and products increasingly rely on machine learning models to make data-driven decisions based on interactions with users, infrastructure and other systems. For broader adoption, this practice must (i) accommodate product engineers without ML backgrounds, (ii) support finegrain product-metric evaluation and (iii) optimize for product goals. To address shortcomings of prior platforms, we introduce general principles for and the architecture of an ML platform, Looper, with simple APIs for decision-making and feedback collection. Looper covers the end-to-end ML lifecycle from collecting training data and model training to deployment and inference, and extends support to personalization, causal evaluation with heterogenous treatment effects, and Bayesian tuning for product goals. During the 2021 production deployment Looper simultaneously hosted 440-1,000 ML models that made 4-6 million real-time decisions per second. We sum up experiences of platform adopters and describe their learning curve.
    Learning Classifiers under Delayed Feedback with a Time Window Assumption. (arXiv:2009.13092v2 [cs.LG] UPDATED)
    We consider training a binary classifier under delayed feedback (\emph{DF learning}). For example, in the conversion prediction in online ads, we initially receive negative samples that clicked the ads but did not buy an item; subsequently, some samples among them buy an item then change to positive. In the setting of DF learning, we observe samples over time, then learn a classifier at some point. We initially receive negative samples; subsequently, some samples among them change to positive. This problem is conceivable in various real-world applications such as online advertisements, where the user action takes place long after the first click. Owing to the delayed feedback, naive classification of the positive and negative samples returns a biased classifier. One solution is to use samples that have been observed for more than a certain time window assuming these samples are correctly labeled. However, existing studies reported that simply using a subset of all samples based on the time window assumption does not perform well, and that using all samples along with the time window assumption improves empirical performance. We extend these existing studies and propose a method with the unbiased and convex empirical risk that is constructed from all samples under the time window assumption. To demonstrate the soundness of the proposed method, we provide experimental results on a synthetic and open dataset that is the real traffic log datasets in online advertising.
    Linear Bandit Algorithms with Sublinear Time Complexity. (arXiv:2103.02729v2 [cs.LG] UPDATED)
    We propose two linear bandits algorithms with per-step complexity sublinear in the number of arms $K$. The algorithms are designed for applications where the arm set is extremely large and slowly changing. Our key realization is that choosing an arm reduces to a maximum inner product search (MIPS) problem, which can be solved approximately without breaking regret guarantees. Existing approximate MIPS solvers run in sublinear time. We extend those solvers and present theoretical guarantees for online learning problems, where adaptivity (i.e., a later step depends on the feedback in previous steps) becomes a unique challenge. We then explicitly characterize the tradeoff between the per-step complexity and regret. For sufficiently large $K$, our algorithms have sublinear per-step complexity and $\tilde O(\sqrt{T})$ regret. Empirically, we evaluate our proposed algorithms in a synthetic environment and a real-world online movie recommendation problem. Our proposed algorithms can deliver a more than 72 times speedup compared to the linear time baselines while retaining similar regret.
    Self-Correcting Neural Networks For Safe Classification. (arXiv:2107.11445v2 [cs.LG] UPDATED)
    Classifiers learnt from data are increasingly being used as components in systems where safety is a critical concern. In this work, we present a formal notion of safety for classifiers via constraints called safe-ordering constraints. These constraints relate requirements on the order of the classes output by a classifier to conditions on its input, and are expressive enough to encode various interesting examples of classifier safety specifications from the literature. For classifiers implemented using neural networks, we also present a run-time mechanism for the enforcement of safe-ordering constraints. Our approach is based on a self-correcting layer, which provably yields safe outputs regardless of the characteristics of the classifier input. We compose this layer with an existing neural network classifier to construct a self-correcting network (SC-Net), and show that in addition to providing safe outputs, the SC-Net is guaranteed to preserve the classification accuracy of the original network whenever possible. Our approach is independent of the size and architecture of the neural network used for classification, depending only on the specified property and the dimension of the network's output; thus it is scalable to large state-of-the-art networks. We show that our approach can be optimized for a GPU, introducing run-time overhead of less than 1ms on current hardware -- even on large, widely-used networks containing hundreds of thousands of neurons and millions of parameters.
    Popularity Adjusted Block Models are Generalized Random Dot Product Graphs. (arXiv:2109.04010v2 [stat.ML] UPDATED)
    We connect two random graph models, the Popularity Adjusted Block Model (PABM) and the Generalized Random Dot Product Graph (GRDPG), by demonstrating that the PABM is a special case of the GRDPG in which communities correspond to mutually orthogonal subspaces of latent vectors. This insight allows us to construct new algorithms for community detection and parameter estimation for the PABM, as well as improve an existing algorithm that relies on Sparse Subspace Clustering. Using established asymptotic properties of Adjacency Spectral Embedding for the GRDPG, we derive asymptotic properties of these algorithms. In particular, we demonstrate that the absolute number of community detection errors tends to zero as the number of graph vertices tends to infinity. Simulation experiments illustrate these properties.
    Low-Rank Tensor Recovery with Euclidean-Norm-Induced Schatten-p Quasi-Norm Regularization. (arXiv:2012.03436v3 [cs.LG] UPDATED)
    The nuclear norm and Schatten-$p$ quasi-norm are popular rank proxies in low-rank matrix recovery. Unfortunately, computing the nuclear norm or Schatten-$p$ quasi-norm of a tensor is NP-hard, which is a pity for low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA). In this paper, we propose a new class of tensor rank regularizers based on the Euclidean norms of the CP component vectors of a tensor and show that these regularizers are monotonic transformations of tensor Schatten-$p$ quasi-norm. This connection enables us to minimize the Schatten-$p$ quasi-norm in LRTC and TRPCA implicitly. The methods do not use the singular value decomposition and hence scale to big tensors. Moreover, the methods are not sensitive to the choice of initial rank and provide an arbitrarily sharper rank proxy for low-rank tensor recovery compared to nuclear norm. On the other hand, we study the generalization abilities of LRTC with Schatten-$p$ quasi-norm regularization and LRTC with our regularizers. The theorems show that a relatively sharper regularizer leads to a tighter error bound, which is consistent with our numerical results. Numerical results on synthetic data and real data demonstrate the effectiveness and superiority of our methods compared to baseline methods.
    Topologically penalized regression on manifolds. (arXiv:2110.13749v2 [cs.LG] UPDATED)
    We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the eigenfunctions or the estimated function. The overall approach is shown to yield promising and competitive performance on various applications to both synthetic and real data sets. We also provide theoretical guarantees on the regression function estimates, on both its prediction error and its smoothness (in a topological sense). Taken together, these results support the relevance of our approach in the case where the targeted function is ''topologically smooth''.
    Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning. (arXiv:1910.06151v3 [cs.DS] UPDATED)
    We present an algorithmic framework for quantum-inspired classical algorithms on close-to-low-rank matrices, generalizing the series of results started by Tang's breakthrough quantum-inspired algorithm for recommendation systems [STOC'19]. Motivated by quantum linear algebra algorithms and the quantum singular value transformation (SVT) framework of Gily\'en, Su, Low, and Wiebe [STOC'19], we develop classical algorithms for SVT that run in time independent of input dimension, under suitable quantum-inspired sampling assumptions. Our results give compelling evidence that in the corresponding QRAM data structure input model, quantum SVT does not yield exponential quantum speedups. Since the quantum SVT framework generalizes essentially all known techniques for quantum linear algebra, our results, combined with sampling lemmas from previous work, suffice to generalize all recent results about dequantizing quantum machine learning algorithms. In particular, our classical SVT framework recovers and often improves the dequantization results on recommendation systems, principal component analysis, supervised clustering, support vector machines, low-rank regression, and semidefinite program solving. We also give additional dequantization results on low-rank Hamiltonian simulation and discriminant analysis. Our improvements come from identifying the key feature of the quantum-inspired input model that is at the core of all prior quantum-inspired results: $\ell^2$-norm sampling can approximate matrix products in time independent of their dimension. We reduce all our main results to this fact, making our exposition concise, self-contained, and intuitive.
    Encoding protein dynamic information in graph representation for functional residue identification. (arXiv:2112.12033v2 [q-bio.BM] UPDATED)
    Recent advances in protein function prediction exploit graph-based deep learning approaches to correlate the structural and topological features of proteins with their molecular functions. However, proteins in vivo are not static but dynamic molecules that alter conformation for functional purposes. Here we apply normal mode analysis to native protein conformations and augment protein graphs by connecting edges between dynamically correlated residue pairs. In the multilabel function classification task, our method demonstrates a remarkable performance gain based on this dynamics-informed representation. The proposed graph neural network, ProDAR, increases the interpretability and generalizability of residue-level annotations and robustly reflects structural nuance in proteins. We elucidate the importance of dynamic information in graph representation by comparing class activation maps for hMTH1, nitrophorin, and SARS-CoV-2 receptor binding domain. Our model successfully learns the dynamic fingerprints of proteins and pinpoints the residues of functional impacts, with vast untapped potential for broad biotechnology and pharmaceutical applications.
    Rethinking Spatial Invariance of Convolutional Networks for Object Counting. (arXiv:2206.05253v1 [cs.CV])
    Previous work generally believes that improving the spatial invariance of convolutional networks is the key to object counting. However, after verifying several mainstream counting networks, we surprisingly found too strict pixel-level spatial invariance would cause overfit noise in the density map generation. In this paper, we try to use locally connected Gaussian kernels to replace the original convolution filter to estimate the spatial position in the density map. The purpose of this is to allow the feature extraction process to potentially stimulate the density map generation process to overcome the annotation noise. Inspired by previous work, we propose a low-rank approximation accompanied with translation invariance to favorably implement the approximation of massive Gaussian convolution. Our work points a new direction for follow-up research, which should investigate how to properly relax the overly strict pixel-level spatial invariance for object counting. We evaluate our methods on 4 mainstream object counting networks (i.e., MCNN, CSRNet, SANet, and ResNet-50). Extensive experiments were conducted on 7 popular benchmarks for 3 applications (i.e., crowd, vehicle, and plant counting). Experimental results show that our methods significantly outperform other state-of-the-art methods and achieve promising learning of the spatial position of objects.
    Interactively Learning Preference Constraints in Linear Bandits. (arXiv:2206.05255v1 [cs.LG])
    We study sequential decision-making with known rewards and unknown constraints, motivated by situations where the constraints represent expensive-to-evaluate human preferences, such as safe and comfortable driving behavior. We formalize the challenge of interactively learning about these constraints as a novel linear bandit problem which we call constrained linear best-arm identification. To solve this problem, we propose the Adaptive Constraint Learning (ACOL) algorithm. We provide an instance-dependent lower bound for constrained linear best-arm identification and show that ACOL's sample complexity matches the lower bound in the worst-case. In the average case, ACOL's sample complexity bound is still significantly tighter than bounds of simpler approaches. In synthetic experiments, ACOL performs on par with an oracle solution and outperforms a range of baselines. As an application, we consider learning constraints to represent human preferences in a driving simulation. ACOL is significantly more sample efficient than alternatives for this application. Further, we find that learning preferences as constraints is more robust to changes in the driving scenario than encoding the preferences directly in the reward function.
    Meta Optimal Transport. (arXiv:2206.05262v1 [cs.LG])
    We study the use of amortized optimization to predict optimal transport (OT) maps from the input measures, which we call Meta OT. This helps repeatedly solve similar OT problems between different measures by leveraging the knowledge and information present from past problems to rapidly predict and solve new problems. Otherwise, standard methods ignore the knowledge of the past solutions and suboptimally re-solve each problem from scratch. Meta OT models surpass the standard convergence rates of log-Sinkhorn solvers in the discrete setting and convex potentials in the continuous setting. We improve the computational time of standard OT solvers by multiple orders of magnitude in discrete and continuous transport settings between images, spherical data, and color palettes. Our source code is available at this http URL
    AxFormer: Accuracy-driven Approximation of Transformers for Faster, Smaller and more Accurate NLP Models. (arXiv:2010.03688v2 [cs.CL] UPDATED)
    Transformers have greatly advanced the state-of-the-art in Natural Language Processing (NLP) in recent years, but present very large computation and storage requirements. We observe that the design process of Transformers (pre-train a foundation model on a large dataset in a self-supervised manner, and subsequently fine-tune it for different downstream tasks) leads to task-specific models that are highly over-parameterized, adversely impacting both accuracy and inference efficiency. We propose AxFormer, a systematic framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task. AxFormer combines two key optimizations -- accuracy-driven pruning and selective hard attention. Accuracy-driven pruning identifies and removes parts of the fine-tuned transformer that hinder performance on the given downstream task. Sparse hard-attention optimizes attention blocks in selected layers by eliminating irrelevant word aggregations, thereby helping the model focus only on the relevant parts of the input. In effect, AxFormer leads to models that are more accurate, while also being faster and smaller. Our experiments on GLUE and SQUAD tasks show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models. In addition, we demonstrate that AxFormer can be combined with previous efforts such as distillation or quantization to achieve further efficiency gains.
    Projected State-action Balancing Weights for Offline Reinforcement Learning. (arXiv:2109.04640v2 [cs.LG] UPDATED)
    Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
    LassoBench: A High-Dimensional Hyperparameter Optimization Benchmark Suite for Lasso. (arXiv:2111.02790v3 [cs.LG] UPDATED)
    While Weighted Lasso sparse regression has appealing statistical guarantees that would entail a major real-world impact in finance, genomics, and brain imaging applications, it is typically scarcely adopted due to its complex high-dimensional space composed by thousands of hyperparameters. On the other hand, the latest progress with high-dimensional hyperparameter optimization (HD-HPO) methods for black-box functions demonstrates that high-dimensional applications can indeed be efficiently optimized. Despite this initial success, HD-HPO approaches are mostly applied to synthetic problems with a moderate number of dimensions, which limits its impact in scientific and engineering applications. We propose LassoBench, the first benchmark suite tailored for Weighted Lasso regression. LassoBench consists of benchmarks for both well-controlled synthetic setups (number of samples, noise level, ambient and effective dimensionalities, and multiple fidelities) and real-world datasets, which enables the use of many flavors of HPO algorithms to be studied and extended to the high-dimensional Lasso setting. We evaluate 6 state-of-the-art HPO methods and 3 Lasso baselines, and demonstrate that Bayesian optimization and evolutionary strategies can improve over the methods commonly used for sparse regression while highlighting limitations of these frameworks in very high-dimensional and noisy settings.
    Accelerated Algorithms for Monotone Inclusions and Constrained Nonconvex-Nonconcave Min-Max Optimization. (arXiv:2206.05248v1 [math.OC])
    We study monotone inclusions and monotone variational inequalities, as well as their generalizations to non-monotone settings. We first show that the Extra Anchored Gradient (EAG) algorithm, originally proposed by Yoon and Ryu [2021] for unconstrained convex-concave min-max optimization, can be applied to solve the more general problem of Lipschitz monotone inclusion. More specifically, we prove that the EAG solves Lipschitz monotone inclusion problems with an \emph{accelerated convergence rate} of $O(\frac{1}{T})$, which is \emph{optimal among all first-order methods} [Diakonikolas, 2020, Yoon and Ryu, 2021]. Our second result is a new algorithm, called Extra Anchored Gradient Plus (EAG+), which not only achieves the accelerated $O(\frac{1}{T})$ convergence rate for all monotone inclusion problems, but also exhibits the same accelerated rate for a family of general (non-monotone) inclusion problems that concern negative comonotone operators. As a special case of our second result, EAG+ enjoys the $O(\frac{1}{T})$ convergence rate for solving a non-trivial class of nonconvex-nonconcave min-max optimization problems. Our analyses are based on simple potential function arguments, which might be useful for analysing other accelerated algorithms.
    Is Self-Supervised Learning More Robust Than Supervised Learning?. (arXiv:2206.05259v1 [cs.CV])
    Self-supervised contrastive learning is a powerful tool to learn visual representation without labels. Prior work has primarily focused on evaluating the recognition accuracy of various pre-training algorithms, but has overlooked other behavioral aspects. In addition to accuracy, distributional robustness plays a critical role in the reliability of machine learning models. We design and conduct a series of robustness tests to quantify the behavioral differences between contrastive learning and supervised learning to downstream or pre-training data distribution changes. These tests leverage data corruptions at multiple levels, ranging from pixel-level gamma distortion to patch-level shuffling and to dataset-level distribution shift. Our tests unveil intriguing robustness behaviors of contrastive and supervised learning. On the one hand, under downstream corruptions, we generally observe that contrastive learning is surprisingly more robust than supervised learning. On the other hand, under pre-training corruptions, we find contrastive learning vulnerable to patch shuffling and pixel intensity change, yet less sensitive to dataset-level distribution change. We attempt to explain these results through the role of data augmentation and feature space properties. Our insight has implications in improving the downstream robustness of supervised learning.
    Unifying mirror descent and dual averaging. (arXiv:1910.13742v4 [math.OC] UPDATED)
    We introduce and analyze a new family of first-order optimization algorithms which generalizes and unifies both mirror descent and dual averaging. Within the framework of this family, we define new algorithms for constrained optimization that combines the advantages of mirror descent and dual averaging. Our preliminary simulation study shows that these new algorithms significantly outperform available methods in some situations.
    Learning the Space of Deep Models. (arXiv:2206.05194v1 [cs.CV])
    Embedding of large but redundant data, such as images or text, in a hierarchy of lower-dimensional spaces is one of the key features of representation learning approaches, which nowadays provide state-of-the-art solutions to problems once believed hard or impossible to solve. In this work, in a plot twist with a strong meta aftertaste, we show how trained deep models are as redundant as the data they are optimized to process, and how it is therefore possible to use deep learning models to embed deep learning models. In particular, we show that it is possible to use representation learning to learn a fixed-size, low-dimensional embedding space of trained deep models and that such space can be explored by interpolation or optimization to attain ready-to-use models. We find that it is possible to learn an embedding space of multiple instances of the same architecture and of multiple architectures. We address image classification and neural representation of signals, showing how our embedding space can be learnt so as to capture the notions of performance and 3D shape, respectively. In the Multi-Architecture setting we also show how an embedding trained only on a subset of architectures can learn to generate already-trained instances of architectures it never sees instantiated at training time.
    ROI Constrained Bidding via Curriculum-Guided Bayesian Reinforcement Learning. (arXiv:2206.05240v1 [cs.LG])
    Real-Time Bidding (RTB) is an important mechanism in modern online advertising systems. Advertisers employ bidding strategies in RTB to optimize their advertising effects subject to various financial requirements, among which a widely adopted one is the return-on-investment (ROI) constraint. ROIs change non-monotonically during the sequential bidding process, usually presenting a see-saw effect between constraint satisfaction and objective optimization. Existing solutions to the constraint-objective trade-off are typically established in static or mildly changing markets. However, these methods fail significantly in non-stationary advertising markets due to their inability to adapt to varying dynamics and partial observability. In this work, we specialize in ROI-Constrained Bidding in non-stationary markets. Based on a Partially Observable Constrained Markov Decision Process, we propose the first hard barrier solution to accommodate non-monotonic constraints. Our method exploits a parameter-free indicator-augmented reward function and develops a Curriculum-Guided Bayesian Reinforcement Learning (CBRL) framework to adaptively control the constraint-objective trade-off in non-stationary advertising markets. Extensive experiments on a large-scale industrial dataset with two problem settings reveal that CBRL generalizes well in both in-distribution and out-of-distribution data regimes, and enjoys outstanding stability.
    Causal Balancing for Domain Generalization. (arXiv:2206.05263v1 [cs.LG])
    While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. While current domain generalization methods usually focus on enforcing certain invariance properties across different domains by new loss function designs, we propose a balanced mini-batch sampling strategy to reduce the domain-specific spurious correlations in the observed training distributions. More specifically, we propose a two-phased method that 1) identifies the source of spurious correlations, and 2) builds balanced mini-batches free from spurious correlations by matching on the identified source. We provide an identifiability guarantee of the source of spuriousness and show that our proposed approach provably samples from a balanced, spurious-free distribution over all training environments. Experiments are conducted on three computer vision datasets with documented spurious correlations, demonstrating empirically that our balanced mini-batch sampling strategy improves the performance of four different established domain generalization model baselines compared to the random mini-batch sampling strategy.
    Measuring the Carbon Intensity of AI in Cloud Instances. (arXiv:2206.05229v1 [cs.LG])
    By providing unprecedented access to computational resources, cloud computing has enabled rapid growth in technologies such as machine learning, the computational demands of which incur a high energy cost and a commensurate carbon footprint. As a result, recent scholarship has called for better estimates of the greenhouse gas impact of AI: data scientists today do not have easy or reliable access to measurements of this information, precluding development of actionable tactics. Cloud providers presenting information about software carbon intensity to users is a fundamental stepping stone towards minimizing emissions. In this paper, we provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions by using location-based and time-specific marginal emissions data per energy unit. We provide measurements of operational software carbon intensity for a set of modern models for natural language processing and computer vision, and a wide range of model sizes, including pretraining of a 6.1 billion parameter language model. We then evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform: using cloud instances in different geographic regions, using cloud instances at different times of day, and dynamically pausing cloud instances when the marginal carbon intensity is above a certain threshold. We confirm previous results that the geographic region of the data center plays a significant role in the carbon intensity for a given cloud instance, and find that choosing an appropriate region can have the largest operational emissions reduction impact. We also show that the time of day has notable impact on operational software carbon intensity. Finally, we conclude with recommendations for how machine learning practitioners can use software carbon intensity information to reduce environmental impact.
    StructCoder: Structure-Aware Transformer for Code Generation. (arXiv:2206.05239v1 [cs.LG])
    There has been a recent surge of interest in automating software engineering tasks using deep learning. This work addresses the problem of code generation where the goal is to generate target code given source code in a different language or a natural language description. Most of the state-of-the-art deep learning models for code generation use training strategies that are primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also ensure that our decoder preserves the syntax and data flow of the target code by introducing two auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder to enhance the quality of generated code by modeling target syntax and data flow. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark.
    Tight Bounds for State Tomography with Incoherent Measurements. (arXiv:2206.05265v1 [quant-ph])
    We consider the classic question of state tomography: given copies of an unknown quantum state $\rho\in\mathbb{C}^{d\times d}$, output $\widehat{\rho}$ for which $\|\rho - \widehat{\rho}\|_{\mathsf{tr}} \le \varepsilon$. When one is allowed to make coherent measurements entangled across all copies, $\Theta(d^2/\varepsilon^2)$ copies are necessary and sufficient [Haah et al. '17, O'Donnell-Wright '16]. Unfortunately, the protocols achieving this rate incur large quantum memory overheads that preclude implementation on current or near-term devices. On the other hand, the best known protocol using incoherent (single-copy) measurements uses $O(d^3/\varepsilon^2)$ copies [Kueng-Rauhut-Terstiege '17], and multiple papers have posed it as an open question to understand whether or not this rate is tight. In this work, we fully resolve this question, by showing that any protocol using incoherent measurements, even if they are chosen adaptively, requires $\Omega(d^3/\varepsilon^2)$ copies, matching the upper bound of [Kueng-Rauhut-Terstiege '17]. We do so by a new proof technique which directly bounds the "tilt" of the posterior distribution after measurements, which yields a surprisingly short proof of our lower bound, and which we believe may be of independent interest.
    Street Crossing Aid Using Light-weight CNNs for the Visually Impaired. (arXiv:1909.09598v2 [cs.CV] UPDATED)
    In this paper, we address an issue that the visually impaired commonly face while crossing intersections and propose a solution that takes form as a mobile application. The application utilizes a deep learning convolutional neural network model, LytNetV2, to output necessary information that the visually impaired may lack when without human companions or guide-dogs. A prototype of the application runs on iOS devices of versions 11 or above. It is designed for comprehensiveness, concision, accuracy, and computational efficiency through delivering the two most important pieces of information, pedestrian traffic light color and direction, required to cross the road in real-time. Furthermore, it is specifically aimed to support those facing financial burden as the solution takes the form of a free mobile application. Through the modification and utilization of key principles in MobileNetV3 such as depthwise seperable convolutions and squeeze-excite layers, the deep neural network model achieves a classification accuracy of 96% and average angle error of 6.15 degrees, while running at a frame rate of 16.34 frames per second. Additionally, the model is trained as an image classifier, allowing for a faster and more accurate model. The network is able to outperform other methods such as object detection and non-deep learning algorithms in both accuracy and thoroughness. The information is delivered through both auditory signals and vibrations, and it has been tested on seven visually impaired and has received above satisfactory responses.
    Multifidelity Reinforcement Learning with Control Variates. (arXiv:2206.05165v1 [cs.LG])
    In many computational science and engineering applications, the output of a system of interest corresponding to a given input can be queried at different levels of fidelity with different costs. Typically, low-fidelity data is cheap and abundant, while high-fidelity data is expensive and scarce. In this work we study the reinforcement learning (RL) problem in the presence of multiple environments with different levels of fidelity for a given control task. We focus on improving the RL agent's performance with multifidelity data. Specifically, a multifidelity estimator that exploits the cross-correlations between the low- and high-fidelity returns is proposed to reduce the variance in the estimation of the state-action value function. The proposed estimator, which is based on the method of control variates, is used to design a multifidelity Monte Carlo RL (MFMCRL) algorithm that improves the learning of the agent in the high-fidelity environment. The impacts of variance reduction on policy evaluation and policy improvement are theoretically analyzed by using probability bounds. Our theoretical analysis and numerical experiments demonstrate that for a finite budget of high-fidelity data samples, our proposed MFMCRL agent attains superior performance compared with that of a standard RL agent that uses only the high-fidelity environment data for learning the optimal policy.
    GD-VAEs: Geometric Dynamic Variational Autoencoders for Learning Nonlinear Dynamics and Dimension Reductions. (arXiv:2206.05183v1 [cs.LG])
    We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. We develop approaches for learning nonlinear state space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Transpose CNNs (T-CNNs). Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning low dimensional representations of the nonlinear Burgers equations, constrained mechanical systems, and spatial fields of reaction-diffusion systems. GD-VAEs provide methods for obtaining representations for use in learning tasks involving dynamics.
    Hierarchical Federated Learning with Privacy. (arXiv:2206.05209v1 [cs.LG])
    Federated learning (FL), where data remains at the federated clients, and where only gradient updates are shared with a central aggregator, was assumed to be private. Recent work demonstrates that adversaries with gradient-level access can mount successful inference and reconstruction attacks. In such settings, differentially private (DP) learning is known to provide resilience. However, approaches used in the status quo (\ie central and local DP) introduce disparate utility vs. privacy trade-offs. In this work, we take the first step towards mitigating such trade-offs through {\em hierarchical FL (HFL)}. We demonstrate that by the introduction of a new intermediary level where calibrated DP noise can be added, better privacy vs. utility trade-offs can be obtained; we term this {\em hierarchical DP (HDP)}. Our experiments with 3 different datasets (commonly used as benchmarks for FL) suggest that HDP produces models as accurate as those obtained using central DP, where noise is added at a central aggregator. Such an approach also provides comparable benefit against inference adversaries as in the local DP case, where noise is added at the federated clients.
    A Resilient Distributed Boosting Algorithm. (arXiv:2206.04713v1 [cs.LG])
    Given a learning task where the data is distributed among several parties, communication is one of the fundamental resources which the parties would like to minimize. We present a distributed boosting algorithm which is resilient to a limited amount of noise. Our algorithm is similar to classical boosting algorithms, although it is equipped with a new component, inspired by Impagliazzo's hard-core lemma \cite{impagliazzo1995hard}, adding a robustness quality to the algorithm. We also complement this result by showing that resilience to any asymptotically larger noise is not achievable by a communication-efficient algorithm.  ( 2 min )
    Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations. (arXiv:2206.04779v1 [cs.LG])
    Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, to date, offline reinforcement learning from has been relatively under-explored, and there is a lack of understanding of where the remaining challenges lie. In this paper, we seek to establish simple baselines for continuous control in the visual domain. We show that simple modifications to two state-of-the-art vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform prior work and establish a competitive baseline. We rigorously evaluate these algorithms on both existing offline datasets and a new testbed for offline reinforcement learning from visual observations that better represents the data distributions present in real-world offline reinforcement learning problems, and open-source our code and data to facilitate progress in this important domain. Finally, we present and analyze several key desiderata unique to offline RL from visual observations, including visual distractions and visually identifiable changes in dynamics.  ( 2 min )
    Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning. (arXiv:2206.04741v1 [quant-ph])
    We present a full implementation and simulation of a novel quantum reinforcement learning (RL) method and mathematically prove a quantum advantage. Our approach shows in detail how to combine amplitude estimation and Grover search into a policy evaluation and improvement scheme. We first develop quantum policy evaluation (QPE) which is quadratically more efficient compared to an analogous classical Monte Carlo estimation and is based on a quantum mechanical realization of a finite Markov decision process (MDP). Building on QPE, we derive a quantum policy iteration that repeatedly improves an initial policy using Grover search until the optimum is reached. Finally, we present an implementation of our algorithm for a two-armed bandit MDP which we then simulate. The results confirm that QPE provides a quantum advantage in RL problems.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v1 [cs.LG])
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses a provably-exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Joint Entropy Search For Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v1 [cs.LG])
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Lightweight Conditional Model Extrapolation for Streaming Data under Class-Prior Shift. (arXiv:2206.05181v1 [cs.LG])
    We introduce LIMES, a new method for learning with non-stationary streaming data, inspired by the recent success of meta-learning. The main idea is not to attempt to learn a single classifier that would have to work well across all occurring data distributions, nor many separate classifiers, but to exploit a hybrid strategy: we learn a single set of model parameters from which a specific classifier for any specific data distribution is derived via classifier adaptation. Assuming a multi-class classification setting with class-prior shift, the adaptation step can be performed analytically with only the classifier's bias terms being affected. Another contribution of our work is an extrapolation step that predicts suitable adaptation parameters for future time steps based on the previous data. In combination, we obtain a lightweight procedure for learning from streaming data with varying class distribution that adds no trainable parameters and almost no memory or computational overhead compared to training a single model. Experiments on a set of exemplary tasks using Twitter data show that LIMES achieves higher accuracy than alternative approaches, especially with respect to the relevant real-world metric of lowest within-day accuracy.
    An Image Processing Pipeline for Camera Trap Time-Lapse Recordings. (arXiv:2206.05159v1 [cs.CV])
    A new open-source image processing pipeline for analyzing camera trap time-lapse recordings is described. This pipeline includes machine learning models to assist human-in-the-loop video segmentation and animal re-identification. We present some performance results and observations on the utility of this pipeline after using it in a year-long project studying the spatial ecology and social behavior of the gopher tortoise.
    Empirical Bayes approach to Truth Discovery problems. (arXiv:2206.04816v1 [cs.LG])
    When aggregating information from conflicting sources, one's goal is to find the truth. Most real-value \emph{truth discovery} (TD) algorithms try to achieve this goal by estimating the competence of each source and then aggregating the conflicting information by weighing each source's answer proportionally to her competence. However, each of those algorithms requires more than a single source for such estimation and usually does not consider different estimation methods other than a weighted mean. Therefore, in this work we formulate, prove, and empirically test the conditions for an Empirical Bayes Estimator (EBE) to dominate the weighted mean aggregation. Our main result demonstrates that EBE, under mild conditions, can be used as a second step of any TD algorithm in order to reduce the expected error.  ( 2 min )
    Mixed integer linear optimization formulations for learning optimal binary classification trees. (arXiv:2206.04857v1 [cs.LG])
    Decision trees are powerful tools for classification and regression that attract many researchers working in the burgeoning area of machine learning. One advantage of decision trees over other methods is their interpretability, which is often preferred over other higher accuracy methods that are relatively uninterpretable. A binary classification tree has two types of vertices: (i) branching vertices which have exactly two children and where datapoints are assessed on a set of discrete features; and (ii) leaf vertices at which datapoints are given a discrete prediction. An optimal binary classification tree can be obtained by solving a biobjective optimization problem that seeks to (i) maximize the number of correctly classified datapoints and (ii) minimize the number of branching vertices. In this paper, we propose four mixed integer linear optimization (MILO) formulations for designing optimal binary classification trees: two flow-based formulations and two-cut based formulations. We provide theoretical comparisons between our proposed formulations and the strongest flow-based MILO formulation of Aghaei et al. (2021). We conduct experiments on 13 publicly available datasets to show the models' ability to scale and the strength of a biobjective approach using Pareto frontiers. Our code and data are available on GitHub.  ( 2 min )
    Stochastic Zeroth order Descent with Structured Directions. (arXiv:2206.05124v1 [math.OC])
    We introduce and analyze Structured Stochastic Zeroth order Descent (S-SZD), a finite difference approach which approximates a stochastic gradient on a set of $l\leq d$ orthogonal directions, where $d$ is the dimension of the ambient space. These directions are randomly chosen, and may change at each step. For smooth convex functions we prove almost sure convergence of the iterates and a convergence rate on the function values of the form $O(d/l k^{-c})$ for every $c<1/2$, which is arbitrarily close to the one of Stochastic Gradient Descent (SGD) in terms of number of iterations. Our bound also shows the benefits of using $l$ multiple directions instead of one. For non-convex functions satisfying the Polyak-{\L}ojasiewicz condition, we establish the first convergence rates for stochastic zeroth order algorithms under such an assumption. We corroborate our theoretical findings in numerical simulations where assumptions are satisfied and on the real-world problem of hyper-parameter optimization, observing that S-SZD has very good practical performances.
    Strong Memory Lower Bounds for Learning Natural Models. (arXiv:2206.04743v1 [cs.LG])
    We give lower bounds on the amount of memory required by one-pass streaming algorithms for solving several natural learning problems. In a setting where examples lie in $\{0,1\}^d$ and the optimal classifier can be encoded using $\kappa$ bits, we show that algorithms which learn using a near-minimal number of examples, $\tilde O(\kappa)$, must use $\tilde \Omega( d\kappa)$ bits of space. Our space bounds match the dimension of the ambient space of the problem's natural parametrization, even when it is quadratic in the size of examples and the final classifier. For instance, in the setting of $d$-sparse linear classifiers over degree-2 polynomial features, for which $\kappa=\Theta(d\log d)$, our space lower bound is $\tilde\Omega(d^2)$. Our bounds degrade gracefully with the stream length $N$, generally having the form $\tilde\Omega\left(d\kappa \cdot \frac{\kappa}{N}\right)$. Bounds of the form $\Omega(d\kappa)$ were known for learning parity and other problems defined over finite fields. Bounds that apply in a narrow range of sample sizes are also known for linear regression. Ours are the first such bounds for problems of the type commonly seen in recent learning applications that apply for a large range of input sizes.  ( 2 min )
    Training Neural Networks using SAT solvers. (arXiv:2206.04833v1 [cs.LG])
    We propose an algorithm to explore the global optimization method, using SAT solvers, for training a neural net. Deep Neural Networks have achieved great feats in tasks like-image recognition, speech recognition, etc. Much of their success can be attributed to the gradient-based optimisation methods, which scale well to huge datasets while still giving solutions, better than any other existing methods. However, there exist learning problems like the parity function and the Fast Fourier Transform, where a neural network using gradient-based optimisation algorithm can not capture the underlying structure of the learning task properly. Thus, exploring global optimisation methods is of utmost interest as the gradient-based methods get stuck in local optima. In the experiments, we demonstrate the effectiveness of our algorithm against the ADAM optimiser in certain tasks like parity learning. However, in the case of image classification on the MNIST Dataset, the performance of our algorithm was less than satisfactory. We further discuss the role of the size of the training dataset and the hyper-parameter settings in keeping things scalable for a SAT solver.  ( 2 min )
    A new distance measurement and its application in K-Means Algorithm. (arXiv:2206.05215v1 [cs.LG])
    K-Means clustering algorithm is one of the most commonly used clustering algorithms because of its simplicity and efficiency. K-Means clustering algorithm based on Euclidean distance only pays attention to the linear distance between samples, but ignores the overall distribution structure of the dataset (i.e. the fluid structure of dataset). Since it is difficult to describe the internal structure of two data points by Euclidean distance in high-dimensional data space, we propose a new distance measurement, namely, view-distance, and apply it to the K-Means algorithm. On the classical manifold learning datasets, S-curve and Swiss roll datasets, not only this new distance can cluster the data according to the structure of the data itself, but also the boundaries between categories are neat dividing lines. Moreover, we also tested the classification accuracy and clustering effect of the K-Means algorithm based on view-distance on some real-world datasets. The experimental results show that, on most datasets, the K-Means algorithm based on view-distance has a certain degree of improvement in classification accuracy and clustering effect.
    Mildly Conservative Q-Learning for Offline Reinforcement Learning. (arXiv:2206.04745v1 [cs.LG])
    Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines.  ( 2 min )
    $\mathsf{G^2Retro}$: Two-Step Graph Generative Models for Retrosynthesis Prediction. (arXiv:2206.04882v1 [cs.LG])
    Retrosynthesis is a procedure where a molecule is transformed into potential reactants and thus the synthesis routes are identified. We propose a novel generative framework, denoted as $\mathsf{G^2Retro}$, for one-step retrosynthesis prediction. $\mathsf{G^2Retro}$ imitates the reversed logic of synthetic reactions, that is, first predicting the reaction centers to convert the target molecule into fragments named synthons, and then transforming synthons into reactants, following previous semi-template-based methods. In predicting reaction centers, $\mathsf{G^2Retro}$ defines a comprehensive set of reaction center types, and enables diversity in the predicted reactions by considering multiple reaction center candidates. In completing synthons, $\mathsf{G^2Retro}$ deploys a sequence of substructure attachments to transform synthons into reactants, which utilize a holistic view of the most updated structures of the synthons to be completed, as well as all the involved synthon and product structures. Here we show that $\mathsf{G^2Retro}$ is able to better prioritize the most possible reactants in the benchmark dataset than the state-of-the-art methods, and discover novel and highly likely reactions that are not included in the benchmark dataset.  ( 2 min )
    Bayesian Estimation of Differential Privacy. (arXiv:2206.05199v1 [cs.LG])
    Algorithms such as Differentially Private SGD enable training machine learning models with formal privacy guarantees. However, there is a discrepancy between the protection that such algorithms guarantee in theory and the protection they afford in practice. An emerging strand of work empirically estimates the protection afforded by differentially private training as a confidence interval for the privacy budget $\varepsilon$ spent on training a model. Existing approaches derive confidence intervals for $\varepsilon$ from confidence intervals for the false positive and false negative rates of membership inference attacks. Unfortunately, obtaining narrow high-confidence intervals for $\epsilon$ using this method requires an impractically large sample size and training as many models as samples. We propose a novel Bayesian method that greatly reduces sample size, and adapt and validate a heuristic to draw more than one sample per trained model. Our Bayesian method exploits the hypothesis testing interpretation of differential privacy to obtain a posterior for $\varepsilon$ (not just a confidence interval) from the joint posterior of the false positive and false negative rates of membership inference attacks. For the same sample size and confidence, we derive confidence intervals for $\varepsilon$ around 40% narrower than prior work. The heuristic, which we adapt from label-only DP, can be used to further reduce the number of trained models needed to get enough samples by up to 2 orders of magnitude.
    Learning Attention-based Representations from Multiple Patterns for Relation Prediction in Knowledge Graphs. (arXiv:2206.04801v1 [cs.AI])
    Knowledge bases, and their representations in the form of knowledge graphs (KGs), are naturally incomplete. Since scientific and industrial applications have extensively adopted them, there is a high demand for solutions that complete their information. Several recent works tackle this challenge by learning embeddings for entities and relations, then employing them to predict new relations among the entities. Despite their aggrandizement, most of those methods focus only on the local neighbors of a relation to learn the embeddings. As a result, they may fail to capture the KGs' context information by neglecting long-term dependencies and the propagation of entities' semantics. In this manuscript, we propose {\AE}MP (Attention-based Embeddings from Multiple Patterns), a novel model for learning contextualized representations by: (i) acquiring entities' context information through an attention-enhanced message-passing scheme, which captures the entities' local semantics while focusing on different aspects of their neighborhood; and (ii) capturing the semantic context, by leveraging the paths and their relationships between entities. Our empirical findings draw insights into how attention mechanisms can improve entities' context representation and how combining entities and semantic path contexts improves the general representation of entities and the relation predictions. Experimental results on several large and small knowledge graph benchmarks show that {\AE}MP either outperforms or competes with state-of-the-art relation prediction methods.  ( 2 min )
    Neural Laplace: Learning diverse classes of differential equations in the Laplace domain. (arXiv:2206.04843v1 [cs.LG])
    Neural Ordinary Differential Equations model dynamical systems with \textit{ODE}s learned by neural networks. However, ODEs are fundamentally inadequate to model systems with long-range dependencies or discontinuities, which are common in engineering and biological systems. Broader classes of differential equations (DE) have been proposed as remedies, including delay differential equations and integro-differential equations. Furthermore, Neural ODE suffers from numerical instability when modelling stiff ODEs and ODEs with piecewise forcing functions. In this work, we propose \textit{Neural Laplace}, a unified framework for learning diverse classes of DEs including all the aforementioned ones. Instead of modelling the dynamics in the time domain, we model it in the Laplace domain, where the history-dependencies and discontinuities in time can be represented as summations of complex exponentials. To make learning more efficient, we use the geometrical stereographic map of a Riemann sphere to induce more smoothness in the Laplace domain. In the experiments, Neural Laplace shows superior performance in modelling and extrapolating the trajectories of diverse classes of DEs, including the ones with complex history dependency and abrupt changes.  ( 2 min )
    Detecting Anomalous Cryptocurrency Transactions: an AML/CFT Application of Machine Learning-based Forensics. (arXiv:2206.04803v1 [cs.CR])
    The rise of blockchain and distributed ledger technologies (DLTs) in the financial sector has generated a socio-economic shift that triggered legal concerns and regulatory initiatives. While the anonymity of DLTs may safeguard the right to privacy, data protection and other civil liberties, lack of identification hinders accountability, investigation and enforcement. The resulting challenges extend to the rules to combat money laundering and the financing of terrorism and proliferation (AML/CFT). As law enforcement agencies and analytics companies have begun to successfully apply forensics to track currency across blockchain ecosystems, in this paper we focus on the increasing relevance of these techniques. In particular, we offer insights into the application to the Internet of Money (IoM) of machine learning, network and transaction graph analysis. After providing some background on the notion of anonymity in the IoM and on the interplay between AML/CFT and blockchain forensics, we focus on anomaly detection approaches leading to our experiments. Namely, we analyzed a real-world dataset of Bitcoin transactions represented as a directed graph network through various machine learning techniques. Our claim is that the AML/CFT domain could benefit from novel graph analysis methods in machine learning. Indeed, our findings show that the Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) neural network types represent a promising solution for AML/CFT compliance.  ( 2 min )
    AI-MIA: COVID-19 Detection & Severity Analysis through Medical Imaging. (arXiv:2206.04732v1 [eess.IV])
    This paper presents the baseline approach for the organized 2nd Covid-19 Competition, occurring in the framework of the AIMIA Workshop in the European Conference on Computer Vision (ECCV 2022). It presents the COV19-CT-DB database which is annotated for COVID-19 detction, consisting of about 7,700 3-D CT scans. Part of the database consisting of Covid-19 cases is further annotated in terms of four Covid-19 severity conditions. We have split the database and the latter part of it in training, validation and test datasets. The former two datasets are used for training and validation of machine learning models, while the latter will be used for evaluation of the developed models. The baseline approach consists of a deep learning approach, based on a CNN-RNN network and report its performance on the COVID19-CT-DB database.  ( 2 min )
    Distributionally Robust End-to-End Portfolio Construction. (arXiv:2206.05134v1 [q-fin.CP])
    We propose an end-to-end distributionally robust system for portfolio construction that integrates the asset return prediction model with a distributionally robust portfolio optimization model. We also show how to learn the risk-tolerance parameter and the degree of robustness directly from data. End-to-end systems have an advantage in that information can be communicated between the prediction and decision layers during training, allowing the parameters to be trained for the final task rather than solely for predictive performance. However, existing end-to-end systems are not able to quantify and correct for the impact of model risk on the decision layer. Our proposed distributionally robust end-to-end portfolio selection system explicitly accounts for the impact of model risk. The decision layer chooses portfolios by solving a minimax problem where the distribution of the asset returns is assumed to belong to an ambiguity set centered around a nominal distribution. Using convex duality, we recast the minimax problem in a form that allows for efficient training of the end-to-end system.
    MEAT: Maneuver Extraction from Agent Trajectories. (arXiv:2206.05158v1 [cs.CV])
    Advances in learning-based trajectory prediction are enabled by large-scale datasets. However, in-depth analysis of such datasets is limited. Moreover, the evaluation of prediction models is limited to metrics averaged over all samples in the dataset. We propose an automated methodology that allows to extract maneuvers (e.g., left turn, lane change) from agent trajectories in such datasets. The methodology considers information about the agent dynamics and information about the lane segments the agent traveled along. Although it is possible to use the resulting maneuvers for training classification networks, we exemplary use them for extensive trajectory dataset analysis and maneuver-specific evaluation of multiple state-of-the-art trajectory prediction models. Additionally, an analysis of the datasets and an evaluation of the prediction models based on the agent dynamics is provided.
    Deep Multi-Agent Reinforcement Learning with Hybrid Action Spaces based on Maximum Entropy. (arXiv:2206.05108v1 [cs.LG])
    Multi-agent deep reinforcement learning has been applied to address a variety of complex problems with either discrete or continuous action spaces and achieved great success. However, most real-world environments cannot be described by only discrete action spaces or only continuous action spaces. And there are few works having ever utilized deep reinforcement learning (drl) to multi-agent problems with hybrid action spaces. Therefore, we propose a novel algorithm: Deep Multi-Agent Hybrid Soft Actor-Critic (MAHSAC) to fill this gap. This algorithm follows the centralized training but decentralized execution (CTDE) paradigm, and extend the Soft Actor-Critic algorithm (SAC) to handle hybrid action space problems in Multi-Agent environments based on maximum entropy. Our experiences are running on an easy multi-agent particle world with a continuous observation and discrete action space, along with some basic simulated physics. The experimental results show that MAHSAC has good performance in training speed, stability, and anti-interference ability. At the same time, it outperforms existing independent deep hybrid learning method in cooperative scenarios and competitive scenarios.
    Dynamic mean field programming. (arXiv:2206.05200v1 [stat.ML])
    A dynamic mean field theory is developed for model based Bayesian reinforcement learning in the large state space limit. In an analogy with the statistical physics of disordered systems, the transition probabilities are interpreted as couplings, and value functions as deterministic spins, and thus the sampled transition probabilities are considered to be quenched random variables. The results reveal that, under standard assumptions, the posterior over Q-values is asymptotically independent and Gaussian across state-action pairs, for infinite horizon problems. The finite horizon case exhibits the same behaviour for all state-actions pairs at each time but has an additional correlation across time, for each state-action pair. The results also hold for policy evaluation. The Gaussian statistics can be computed from a set of coupled mean field equations derived from the Bellman equation, which we call dynamic mean field programming (DMFP). For Q-value iteration, approximate equations are obtained by appealing to extreme value theory, and closed form expressions are found in the independent and identically distributed case. The Lyapunov stability of these closed form equations is studied.
    How Much is Enough? A Study on Diffusion Times in Score-based Generative Models. (arXiv:2206.05173v1 [stat.ML])
    Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, an analytical understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off, and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive w.r.t. the state-of-the-art, according to standard sample quality metrics and log-likelihood.
    In Defense of Core-set: A Density-aware Core-set Selection for Active Learning. (arXiv:2206.04838v1 [cs.LG])
    Active learning enables the efficient construction of a labeled dataset by labeling informative samples from an unlabeled dataset. In a real-world active learning scenario, considering the diversity of the selected samples is crucial because many redundant or highly similar samples exist. Core-set approach is the promising diversity-based method selecting diverse samples based on the distance between samples. However, the approach poorly performs compared to the uncertainty-based approaches that select the most difficult samples where neural models reveal low confidence. In this work, we analyze the feature space through the lens of the density and, interestingly, observe that locally sparse regions tend to have more informative samples than dense regions. Motivated by our analysis, we empower the core-set approach with the density-awareness and propose a density-aware core-set (DACS). The strategy is to estimate the density of the unlabeled samples and select diverse samples mainly from sparse regions. To reduce the computational bottlenecks in estimating the density, we also introduce a new density approximation based on locality-sensitive hashing. Experimental results clearly demonstrate the efficacy of DACS in both classification and regression tasks and specifically show that DACS can produce state-of-the-art performance in a practical scenario. Since DACS is weakly dependent on neural architectures, we present a simple yet effective combination method to show that the existing methods can be beneficially combined with DACS.  ( 2 min )
    Adversarial Counterfactual Environment Model Learning. (arXiv:2206.04890v1 [cs.LG])
    A good model for action-effect prediction, named environment model, is important to achieve sample-efficient decision-making policy learning in many domains like robot control, recommender systems, and patients' treatment selection. We can take unlimited trials with such a model to identify the appropriate actions so that the costs of queries in the real world can be saved. It requires the model to handle unseen data correctly, also called counterfactual data. However, standard data fitting techniques do not automatically achieve such generalization ability and commonly result in unreliable models. In this work, we introduce counterfactual-query risk minimization (CQRM) in model learning for generalizing to a counterfactual dataset queried by a specific target policy. Since the target policies can be various and unknown in policy learning, we propose an adversarial CQRM objective in which the model learns on counterfactual data queried by adversarial policies, and finally derive a tractable solution GALILEO. We also discover that adversarial CQRM is closely related to the adversarial model learning, explaining the effectiveness of the latter. We apply GALILEO in synthetic tasks and a real-world application. The results show that GALILEO makes accurate predictions on counterfactual data and thus significantly improves policies in real-world testing.  ( 2 min )
    Federated Momentum Contrastive Clustering. (arXiv:2206.05093v1 [cs.LG])
    We present federated momentum contrastive clustering (FedMCC), a learning framework that can not only extract discriminative representations over distributed local data but also perform data clustering. In FedMCC, a transformed data pair passes through both the online and target networks, resulting in four representations over which the losses are determined. The resulting high-quality representations generated by FedMCC can outperform several existing self-supervised learning methods for linear evaluation and semi-supervised learning tasks. FedMCC can easily be adapted to ordinary centralized clustering through what we call momentum contrastive clustering (MCC). We show that MCC achieves state-of-the-art clustering accuracy results in certain datasets such as STL-10 and ImageNet-10. We also present a method to reduce the memory footprint of our clustering schemes.
    Tensor Train for Global Optimization Problems in Robotics. (arXiv:2206.05077v1 [cs.RO])
    The convergence of many numerical optimization techniques is highly sensitive to the initial guess provided to the solver. We propose an approach based on tensor methods to initialize the existing optimization solvers close to global optima. The approach uses only the definition of the cost function and does not need access to any database of good solutions. We first transform the cost function, which is a function of task parameters and optimization variables, into a probability density function. Unlike existing approaches that set the task parameters as constant, we consider them as another set of random variables and approximate the joint probability distribution of the task parameters and the optimization variables using a surrogate probability model. For a given task, we then generate samples from the conditional distribution with respect to the given task parameter and use them as initialization for the optimization solver. As conditioning and sampling from an arbitrary density function are challenging, we use Tensor Train decomposition to obtain a surrogate probability model from which we can efficiently obtain the conditional model and the samples. The method can produce multiple solutions coming from different modes (when they exist) for a given task. We first evaluate the approach by applying it to various challenging benchmark functions for numerical optimization that are difficult to solve using gradient-based optimization solvers with a naive initialization, showing that the proposed method can produce samples close to the global optima and coming from multiple modes. We then demonstrate the generality of the framework and its relevance to robotics by applying the proposed method to inverse kinematics and motion planning problems with a 7-DoF manipulator.
    Muffliato: Peer-to-Peer Privacy Amplification for Decentralized Optimization and Averaging. (arXiv:2206.05091v1 [cs.CR])
    Decentralized optimization is increasingly popular in machine learning for its scalability and efficiency. Intuitively, it should also provide better privacy guarantees, as nodes only observe the messages sent by their neighbors in the network graph. But formalizing and quantifying this gain is challenging: existing results are typically limited to Local Differential Privacy (LDP) guarantees that overlook the advantages of decentralization. In this work, we introduce pairwise network differential privacy, a relaxation of LDP that captures the fact that the privacy leakage from a node $u$ to a node $v$ may depend on their relative position in the graph. We then analyze the combination of local noise injection with (simple or randomized) gossip averaging protocols on fixed and random communication graphs. We also derive a differentially private decentralized optimization algorithm that alternates between local gradient descent steps and gossip averaging. Our results show that our algorithms amplify privacy guarantees as a function of the distance between nodes in the graph, matching the privacy-utility trade-off of the trusted curator, up to factors that explicitly depend on the graph topology. Finally, we illustrate our privacy gains with experiments on synthetic and real-world datasets.
    Scalable Deep Gaussian Markov Random Fields for General Graphs. (arXiv:2206.05032v1 [stat.ML])
    Machine learning methods on graphs have proven useful in many applications due to their ability to handle generally structured data. The framework of Gaussian Markov Random Fields (GMRFs) provides a principled way to define Gaussian models on graphs by utilizing their sparsity structure. We propose a flexible GMRF model for general graphs built on the multi-layer structure of Deep GMRFs, originally proposed for lattice graphs only. By designing a new type of layer we enable the model to scale to large graphs. The layer is constructed to allow for efficient training using variational inference and existing software frameworks for Graph Neural Networks. For a Gaussian likelihood, close to exact Bayesian inference is available for the latent field. This allows for making predictions with accompanying uncertainty estimates. The usefulness of the proposed model is verified by experiments on a number of synthetic and real world datasets, where it compares favorably to other both Bayesian and deep learning methods.
    We Cannot Guarantee Safety: The Undecidability of Graph Neural Network Verification. (arXiv:2206.05070v1 [cs.LG])
    Graph Neural Networks (GNN) are commonly used for two tasks: (whole) graph classification and node classification. We formally introduce generically formulated decision problems for both tasks, corresponding to the following pattern: given a GNN, some specification of valid inputs, and some specification of valid outputs, decide whether there is a valid input satisfying the output specification. We then prove that graph classifier verification is undecidable in general, implying that there cannot be an algorithm surely guaranteeing the absence of misclassification of any kind. Additionally, we show that verification in the node classification case becomes decidable as soon as we restrict the degree of the considered graphs. Furthermore, we discuss possible changes to these results depending on the considered GNN model and specifications.
    Zero-Shot Audio Classification using Image Embeddings. (arXiv:2206.04984v1 [cs.SD])
    Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection. We extract the semantic image representations from the Open Images dataset and evaluate the performance of the models on an audio subset of AudioSet using semantic information in different domains; image, audio, and textual. We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification. The experimental results show that the image and textual embeddings display similar performance both individually and together. We additionally calculate the semantic acoustic embeddings from the test samples to provide an upper limit to the performance. The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.
    Improved Approximation for Fair Correlation Clustering. (arXiv:2206.05050v1 [cs.LG])
    Correlation clustering is a ubiquitous paradigm in unsupervised machine learning where addressing unfairness is a major challenge. Motivated by this, we study Fair Correlation Clustering where the data points may belong to different protected groups and the goal is to ensure fair representation of all groups across clusters. Our paper significantly generalizes and improves on the quality guarantees of previous work of Ahmadi et al. and Ahmadian et al. as follows. - We allow the user to specify an arbitrary upper bound on the representation of each group in a cluster. - Our algorithm allows individuals to have multiple protected features and ensure fairness simultaneously across them all. - We prove guarantees for clustering quality and fairness in this general setting. Furthermore, this improves on the results for the special cases studied in previous work. Our experiments on real-world data demonstrate that our clustering quality compared to the optimal solution is much better than what our theoretical result suggests.
    PAVI: Plate-Amortized Variational Inference. (arXiv:2206.05111v1 [cs.AI])
    Given some observed data and a probabilistic generative model, Bayesian inference aims at obtaining the distribution of a model's latent parameters that could have yielded the data. This task is challenging for large population studies where thousands of measurements are performed over a cohort of hundreds of subjects, resulting in a massive latent parameter space. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical. In this work, we design structured VI families that can efficiently tackle large population studies. To this end, our main idea is to share the parameterization and learning across the different i.i.d. variables in a generative model -symbolized by the model's plates. We name this concept plate amortization, and illustrate the powerful synergies it entitles, resulting in expressive, parsimoniously parameterized and orders of magnitude faster to train large scale hierarchical variational distributions. We illustrate the practical utility of PAVI through a challenging Neuroimaging example featuring a million latent parameters, demonstrating a significant step towards scalable and expressive Variational Inference.
    Weighted Ensembles for Active Learning with Adaptivity. (arXiv:2206.05009v1 [cs.LG])
    Labeled data can be expensive to acquire in several application domains, including medical imaging, robotics, and computer vision. To efficiently train machine learning models under such high labeling costs, active learning (AL) judiciously selects the most informative data instances to label on-the-fly. This active sampling process can benefit from a statistical function model, that is typically captured by a Gaussian process (GP). While most GP-based AL approaches rely on a single kernel function, the present contribution advocates an ensemble of GP models with weights adapted to the labeled data collected incrementally. Building on this novel EGP model, a suite of acquisition functions emerges based on the uncertainty and disagreement rules. An adaptively weighted ensemble of EGP-based acquisition functions is also introduced to further robustify performance. Extensive tests on synthetic and real datasets showcase the merits of the proposed EGP-based approaches with respect to the single GP-based AL alternatives.
    Saccade Mechanisms for Image Classification, Object Detection and Tracking. (arXiv:2206.05102v1 [cs.CV])
    We examine how the saccade mechanism from biological vision can be used to make deep neural networks more efficient for classification and object detection problems. Our proposed approach is based on the ideas of attention-driven visual processing and saccades, miniature eye movements influenced by attention. We conduct experiments by analyzing: i) the robustness of different deep neural network (DNN) feature extractors to partially-sensed images for image classification and object detection, and ii) the utility of saccades in masking image patches for image classification and object tracking. Experiments with convolutional nets (ResNet-18) and transformer-based models (ViT, DETR, TransTrack) are conducted on several datasets (CIFAR-10, DAVSOD, MSCOCO, and MOT17). Our experiments show intelligent data reduction via learning to mimic human saccades when used in conjunction with state-of-the-art DNNs for classification, detection, and tracking tasks. We observed minimal drop in performance for the classification and detection tasks while only using about 30\% of the original sensor data. We discuss how the saccade mechanism can inform hardware design via ``in-pixel'' processing.
    Temporal Inductive Logic Reasoning. (arXiv:2206.05051v1 [cs.LG])
    Inductive logic reasoning is one of the fundamental tasks on graphs, which seeks to generalize patterns from the data. This task has been studied extensively for traditional graph datasets such as knowledge graphs (KGs), with representative techniques such as inductive logic programming (ILP). Existing ILP methods typically assume learning from KGs with static facts and binary relations. Beyond KGs, graph structures are widely present in other applications such as video instructions, scene graphs and program executions. While inductive logic reasoning is also beneficial for these applications, applying ILP to the corresponding graphs is nontrivial: they are more complex than KGs, which usually involve timestamps and n-ary relations, effectively a type of hypergraph with temporal events. In this work, we study two of such applications and propose to represent them as hypergraphs with time intervals. To reason on this graph, we propose the multi-start random B-walk that traverses this hypergraph. Combining it with a path-consistency algorithm, we propose an efficient backward-chaining ILP method that learns logic rules by generalizing from both the temporal and the relational data.
    Deep Learning-based Massive MIMO CSI Acquisition for 5G Evolution and 6G. (arXiv:2206.04967v1 [eess.SP])
    Recently, inspired by successful applications in many fields, deep learning (DL) technologies for CSI acquisition have received considerable research interest from both academia and industry. Considering the practical feedback mechanism of 5th generation (5G) New radio (NR) networks, we propose two implementation schemes for artificial intelligence for CSI (AI4CSI), the DL-based receiver and end-to-end design, respectively. The proposed AI4CSI schemes were evaluated in 5G NR networks in terms of spectrum efficiency (SE), feedback overhead, and computational complexity, and compared with legacy schemes. To demonstrate whether these schemes can be used in real-life scenarios, both the modeled-based channel data and practically measured channels were used in our investigations. When DL-based CSI acquisition is applied to the receiver only, which has little air interface impact, it provides approximately 25\% SE gain at a moderate feedback overhead level. It is feasible to deploy it in current 5G networks during 5G evolutions. For the end-to-end DL-based CSI enhancements, the evaluations also demonstrated their additional performance gain on SE, which is 6% -- 26% compared with DL-based receivers and 33% -- 58% compared with legacy CSI schemes. Considering its large impact on air-interface design, it will be a candidate technology for 6th generation (6G) networks, in which an air interface designed by artificial intelligence can be used.
    MAREO: Memory- and Attention- based visual REasOning. (arXiv:2206.04928v1 [cs.AI])
    Humans continue to vastly outperform modern AI systems in their ability to parse and understand complex visual scenes flexibly. Attention and memory are two systems known to play a critical role in our ability to selectively maintain and manipulate behaviorally-relevant visual information to solve some of the most challenging visual reasoning tasks. Here, we present a novel architecture for visual reasoning inspired by the cognitive-science literature on visual reasoning, the Memory- and Attention-based (visual) REasOning (MAREO) architecture. MAREO instantiates an active-vision theory, which posits that the brain solves complex visual reasoning problems compositionally by learning to combine previously-learned elementary visual operations to form more complex visual routines. MAREO learns to solve visual reasoning tasks via sequences of attention shifts to route and maintain task-relevant visual information into a memory bank via a multi-head transformer module. Visual routines are then deployed by a dedicated reasoning module trained to judge various relations between objects in the scenes. Experiments on four types of reasoning tasks demonstrate MAREO's ability to learn visual routines in a robust and sample-efficient manner.
    Symbolic image detection using scene and knowledge graphs. (arXiv:2206.04863v1 [cs.CV])
    Sometimes the meaning conveyed by images goes beyond the list of objects they contain; instead, images may express a powerful message to affect the viewers' minds. Inferring this message requires reasoning about the relationships between the objects, and general common-sense knowledge about the components. In this paper, we use a scene graph, a graph representation of an image, to capture visual components. In addition, we generate a knowledge graph using facts extracted from ConceptNet to reason about objects and attributes. To detect the symbols, we propose a neural network framework named SKG-Sym. The framework first generates the representations of the scene graph of the image and its knowledge graph using Graph Convolution Network. The framework then fuses the representations and uses an MLP to classify them. We extend the network further to use an attention mechanism which learn the importance of the graph representations. We evaluate our methods on a dataset of advertisements, and compare it with baseline symbolism classification methods (ResNet and VGG). Results show that our methods outperform ResNet in terms of F-score and the attention-based mechanism is competitive with VGG while it has much lower model complexity.
    Convolutional Layers Are Not Translation Equivariant. (arXiv:2206.04979v1 [cs.CV])
    The purpose of this paper is to correct a misconception about convolutional neural networks (CNNs). CNNs are made up of convolutional layers which are shift equivariant due to weight sharing. However, contrary to popular belief, convolutional layers are not translation equivariant, even when boundary effects are ignored and when pooling and subsampling are absent. This is because shift equivariance is a discrete symmetry while translation equivariance is a continuous symmetry. That discrete systems do not in general inherit continuous equivariances is a fundamental limitation of equivariant deep learning. We discuss two implications of this fact. First, CNNs have achieved success in image processing despite not inheriting the translation equivariance of the physical systems they model. Second, using CNNs to solve partial differential equations (PDEs) will not result in translation equivariant solvers.
    On Neural Architecture Inductive Biases for Relational Tasks. (arXiv:2206.05056v1 [cs.NE])
    Current deep learning approaches have shown good in-distribution generalization performance, but struggle with out-of-distribution generalization. This is especially true in the case of tasks involving abstract relations like recognizing rules in sequences, as we find in many intelligence tests. Recent work has explored how forcing relational representations to remain distinct from sensory representations, as it seems to be the case in the brain, can help artificial systems. Building on this work, we further explore and formalize the advantages afforded by 'partitioned' representations of relations and sensory details, and how this inductive bias can help recompose learned relational structure in newly encountered settings. We introduce a simple architecture based on similarity scores which we name Compositional Relational Network (CoRelNet). Using this model, we investigate a series of inductive biases that ensure abstract relations are learned and represented distinctly from sensory data, and explore their effects on out-of-distribution generalization for a series of relational psychophysics tasks. We find that simple architectural choices can outperform existing models in out-of-distribution generalization. Together, these results show that partitioning relational representations from other information streams may be a simple way to augment existing network architectures' robustness when performing out-of-distribution relational computations.
    Diffeomorphic Counterfactuals with Generative Models. (arXiv:2206.05075v1 [cs.LG])
    Counterfactuals can explain classification decisions of neural networks in a human interpretable way. We propose a simple but effective method to generate such counterfactuals. More specifically, we perform a suitable diffeomorphic coordinate transformation and then perform gradient ascent in these coordinates to find counterfactuals which are classified with great confidence as a specified target class. We propose two methods to leverage generative models to construct such suitable coordinate systems that are either exactly or approximately diffeomorphic. We analyze the generation process theoretically using Riemannian differential geometry and validate the quality of the generated counterfactuals using various qualitative and quantitative measures.
    The Generalized Eigenvalue Problem as a Nash Equilibrium. (arXiv:2206.04993v1 [cs.LG])
    The generalized eigenvalue problem (GEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components, successor features and others. Despite this, most general solvers are prohibitively expensive when dealing with massive data sets and research has instead concentrated on finding efficient solutions to specific problem instances. In this work, we develop a game-theoretic formulation of the top-$k$ GEP whose Nash equilibrium is the set of generalized eigenvectors. We also present a parallelizable algorithm with guaranteed asymptotic convergence to the Nash. Current state-of-the-art methods require $\mathcal{O}(d^2k)$ complexity per iteration which is prohibitively expensive when the number of dimensions ($d$) is large. We show how to achieve $\mathcal{O}(dk)$ complexity, scaling to datasets $100\times$ larger than those evaluated by prior methods. Empirically we demonstrate that our algorithm is able to solve a variety of GEP problem instances including a large-scale analysis of neural network activations.
    Deep Multi-view Semi-supervised Clustering with Sample Pairwise Constraints. (arXiv:2206.04949v1 [cs.CV])
    Multi-view clustering has attracted much attention thanks to the capacity of multi-source information integration. Although numerous advanced methods have been proposed in past decades, most of them generally overlook the significance of weakly-supervised information and fail to preserve the feature properties of multiple views, thus resulting in unsatisfactory clustering performance. To address these issues, in this paper, we propose a novel Deep Multi-view Semi-supervised Clustering (DMSC) method, which jointly optimizes three kinds of losses during networks finetuning, including multi-view clustering loss, semi-supervised pairwise constraint loss and multiple autoencoders reconstruction loss. Specifically, a KL divergence based multi-view clustering loss is imposed on the common representation of multi-view data to perform heterogeneous feature optimization, multi-view weighting and clustering prediction simultaneously. Then, we innovatively propose to integrate pairwise constraints into the process of multi-view clustering by enforcing the learned multi-view representation of must-link samples (cannot-link samples) to be similar (dissimilar), such that the formed clustering architecture can be more credible. Moreover, unlike existing rivals that only preserve the encoders for each heterogeneous branch during networks finetuning, we further propose to tune the intact autoencoders frame that contains both encoders and decoders. In this way, the issue of serious corruption of view-specific and view-shared feature space could be alleviated, making the whole training procedure more stable. Through comprehensive experiments on eight popular image datasets, we demonstrate that our proposed approach performs better than the state-of-the-art multi-view and single-view competitors.
    Spatial Cross-Attention Improves Self-Supervised Visual Representation Learning. (arXiv:2206.05028v1 [cs.CV])
    Unsupervised representation learning methods like SwAV are proved to be effective in learning visual semantics of a target dataset. The main idea behind these methods is that different views of a same image represent the same semantics. In this paper, we further introduce an add-on module to facilitate the injection of the knowledge accounting for spatial cross correlations among the samples. This in turn results in distilling intra-class information including feature level locations and cross similarities between same-class instances. The proposed add-on can be added to existing methods such as the SwAV. We can later remove the add-on module for inference without any modification of the learned weights. Through an extensive set of empirical evaluations, we verify that our method yields an improved performance in detecting the class activation maps, top-1 classification accuracy, and down-stream tasks such as object detection, with different configuration settings.
    Refining neural network predictions using background knowledge. (arXiv:2206.04976v1 [cs.AI])
    Recent work has showed we can use logical background knowledge in learning system to compensate for a lack of labeled training data. Many such methods work by creating a loss function that encodes this knowledge. However, often the logic is discarded after training, even if it is still useful at test-time. Instead, we ensure neural network predictions satisfy the knowledge by refining the predictions with an extra computation step. We introduce differentiable refinement functions that find a corrected prediction close to the original prediction. We study how to effectively and efficiently compute these refinement functions. Using a new algorithm, we combine refinement functions to find refined predictions for logical formulas of any complexity. This algorithm finds optimal refinements on complex SAT formulas in significantly fewer iterations and frequently finds solutions where gradient descent can not.
    Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality. (arXiv:2206.04921v1 [cs.LG])
    Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.
    Evolutionary Echo State Network: evolving reservoirs in the Fourier space. (arXiv:2206.04951v1 [cs.NE])
    The Echo State Network (ESN) is a class of Recurrent Neural Network with a large number of hidden-hidden weights (in the so-called reservoir). Canonical ESN and its variations have recently received significant attention due to their remarkable success in the modeling of non-linear dynamical systems. The reservoir is randomly connected with fixed weights that don't change in the learning process. Only the weights from reservoir to output are trained. Since the reservoir is fixed during the training procedure, we may wonder if the computational power of the recurrent structure is fully harnessed. In this article, we propose a new computational model of the ESN type, that represents the reservoir weights in the Fourier space and performs a fine-tuning of these weights applying genetic algorithms in the frequency domain. The main interest is that this procedure will work in a much smaller space compared to the classical ESN, thus providing a dimensionality reduction transformation of the initial method. The proposed technique allows us to exploit the benefits of the large recurrent structure avoiding the training problems of gradient-based method. We provide a detailed experimental study that demonstrates the good performances of our approach with well-known chaotic systems and real-world data.
    Explanation as Question Answering based on a Task Model of the Agent's Design. (arXiv:2206.05030v1 [cs.HC])
    We describe a stance towards the generation of explanations in AI agents that is both human-centered and design-based. We collect questions about the working of an AI agent through participatory design by focus groups. We capture an agent's design through a Task-Method-Knowledge model that explicitly specifies the agent's tasks and goals, as well as the mechanisms, knowledge and vocabulary it uses for accomplishing the tasks. We illustrate our approach through the generation of explanations in Skillsync, an AI agent that links companies and colleges for worker upskilling and reskilling. In particular, we embed a question-answering agent called AskJill in Skillsync, where AskJill contains a TMK model of Skillsync's design. AskJill presently answers human-generated questions about Skillsync's tasks and vocabulary, and thereby helps explain how it produces its recommendations.
    Efficient Heterogeneous Treatment Effect Estimation With Multiple Experiments and Multiple Outcomes. (arXiv:2206.04907v1 [cs.LG])
    Learning heterogeneous treatment effects (HTEs) is an important problem across many fields. Most existing methods consider the setting with a single treatment arm and a single outcome metric. However, in many real world domains, experiments are run consistently - for example, in internet companies, A/B tests are run every day to measure the impacts of potential changes across many different metrics of interest. We show that even if an analyst cares only about the HTEs in one experiment for one metric, precision can be improved greatly by analyzing all of the data together to take advantage of cross-experiment and cross-outcome metric correlations. We formalize this idea in a tensor factorization framework and propose a simple and scalable model which we refer to as the low rank or LR-learner. Experiments in both synthetic and real data suggest that the LR-learner can be much more precise than independent HTE estimation.
    Fisher SAM: Information Geometry and Sharpness Aware Minimisation. (arXiv:2206.04920v1 [cs.LG])
    Recent sharpness-aware minimisation (SAM) is known to find flat minima which is beneficial for better generalisation with improved robustness. SAM essentially modifies the loss function by reporting the maximum loss value within the small neighborhood around the current iterate. However, it uses the Euclidean ball to define the neighborhood, which can be inaccurate since loss functions for neural networks are typically defined over probability distributions (e.g., class predictive probabilities), rendering the parameter space non Euclidean. In this paper we consider the information geometry of the model parameter space when defining the neighborhood, namely replacing SAM's Euclidean balls with ellipsoids induced by the Fisher information. Our approach, dubbed Fisher SAM, defines more accurate neighborhood structures that conform to the intrinsic metric of the underlying statistical manifold. For instance, SAM may probe the worst-case loss value at either a too nearby or inappropriately distant point due to the ignorance of the parameter space geometry, which is avoided by our Fisher SAM. Another recent Adaptive SAM approach stretches/shrinks the Euclidean ball in accordance with the scale of the parameter magnitudes. This might be dangerous, potentially destroying the neighborhood structure. We demonstrate improved performance of the proposed Fisher SAM on several benchmark datasets/tasks.
    A bio-inspired implementation of a sparse-learning spike-based hippocampus memory model. (arXiv:2206.04924v1 [cs.NE])
    The nervous system, more specifically, the brain, is capable of solving complex problems simply and efficiently, far surpassing modern computers. In this regard, neuromorphic engineering is a research field that focuses on mimicking the basic principles that govern the brain in order to develop systems that achieve such computational capabilities. Within this field, bio-inspired learning and memory systems are still a challenge to be solved, and this is where the hippocampus is involved. It is the region of the brain that acts as a short-term memory, allowing the learning and unstructured and rapid storage of information from all the sensory nuclei of the cerebral cortex and its subsequent recall. In this work, we propose a novel bio-inspired memory model based on the hippocampus with the ability to learn memories, recall them from a cue (a part of the memory associated with the rest of the content) and even forget memories when trying to learn others with the same cue. This model has been implemented on the SpiNNaker hardware platform using Spiking Neural Networks, and a set of experiments and tests were performed to demonstrate its correct and expected operation. The proposed spike-based memory model generates spikes only when it receives an input, being energy efficient, and it needs 7 timesteps for the learning step and 6 timesteps for recalling a previously-stored memory. This work presents the first hardware implementation of a fully functional bio-inspired spike-based hippocampus memory model, paving the road for the development of future more complex neuromorphic systems.
    Response to: Significance and stability of deep learning-based identification of subtypes within major psychiatric disorders. Molecular Psychiatry (2022). (arXiv:2206.04934v1 [cs.LG])
    Recently, Winter and Hahn [1] commented on our work on identifying subtypes of major psychiatry disorders (MPDs) based on neurobiological features using machine learning [2]. They questioned the generalizability of our methods and the statistical significance, stability, and overfitting of the results, and proposed a pipeline for disease subtyping. We appreciate their earnest consideration of our work, however, we need to point out their misconceptions of basic machine-learning concepts and delineate some key issues involved.
    Merak: A Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models. (arXiv:2206.04959v1 [cs.LG])
    Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
    Explaining Neural Networks without Access to Training Data. (arXiv:2206.04891v1 [cs.LG])
    We consider generating explanations for neural networks in cases where the network's training data is not accessible, for instance due to privacy or safety issues. Recently, $\mathcal{I}$-Nets have been proposed as a sample-free approach to post-hoc, global model interpretability that does not require access to training data. They formulate interpretation as a machine learning task that maps network representations (parameters) to a representation of an interpretable function. In this paper, we extend the $\mathcal{I}$-Net framework to the cases of standard and soft decision trees as surrogate models. We propose a suitable decision tree representation and design of the corresponding $\mathcal{I}$-Net output layers. Furthermore, we make $\mathcal{I}$-Nets applicable to real-world tasks by considering more realistic distributions when generating the $\mathcal{I}$-Net's training data. We empirically evaluate our approach against traditional global, post-hoc interpretability approaches and show that it achieves superior results when the training data is not accessible.
    Multi-fidelity Hierarchical Neural Processes. (arXiv:2206.04872v1 [cs.LG])
    Science and engineering fields use computer simulation extensively. These simulations are often run at multiple levels of sophistication to balance accuracy and efficiency. Multi-fidelity surrogate modeling reduces the computational cost by fusing different simulation outputs. Cheap data generated from low-fidelity simulators can be combined with limited high-quality data generated by an expensive high-fidelity simulator. Existing methods based on Gaussian processes rely on strong assumptions of the kernel functions and can hardly scale to high-dimensional settings. We propose Multi-fidelity Hierarchical Neural Processes (MF-HNP), a unified neural latent variable model for multi-fidelity surrogate modeling. MF-HNP inherits the flexibility and scalability of Neural Processes. The latent variables transform the correlations among different fidelity levels from observations to latent space. The predictions across fidelities are conditionally independent given the latent states. It helps alleviate the error propagation issue in existing methods. MF-HNP is flexible enough to handle non-nested high dimensional data at different fidelity levels with varying input and output dimensions. We evaluate MF-HNP on epidemiology and climate modeling tasks, achieving competitive performance in terms of accuracy and uncertainty estimation. In contrast to deep Gaussian Processes with only low-dimensional (< 10) tasks, our method shows great promise for speeding up high-dimensional complex simulations (over 7000 for epidemiology modeling and 45000 for climate modeling).
    What should AI see? Using the Public's Opinion to Determine the Perception of an AI. (arXiv:2206.04776v1 [cs.LG])
    Deep neural networks (DNN) have made impressive progress in the interpretation of image data, so that it is conceivable and to some degree realistic to use them in safety critical applications like automated driving. From an ethical standpoint, the AI algorithm should take into account the vulnerability of objects or subjects on the street that ranges from "not at all", e.g. the road itself, to "high vulnerability" of pedestrians. One way to take this into account is to define the cost of confusion of one semantic category with another and use cost-based decision rules for the interpretation of probabilities, which are the output of DNNs. However, it is an open problem how to define the cost structure, who should be in charge to do that, and thereby define what AI-algorithms will actually "see". As one possible answer, we follow a participatory approach and set up an online survey to ask the public to define the cost structure. We present the survey design and the data acquired along with an evaluation that also distinguishes between perspective (car passenger vs. external traffic participant) and gender. Using simulation based $F$-tests, we find highly significant differences between the groups. These differences have consequences on the reliable detection of pedestrians in a safety critical distance to the self-driving car. We discuss the ethical problems that are related to this approach and also discuss the problems emerging from human-machine interaction through the survey from a psychological point of view. Finally, we include comments from industry leaders in the field of AI safety on the applicability of survey based elements in the design of AI functionalities in automated driving.
    The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the \emph{Grokking Phenomenon}. (arXiv:2206.04817v1 [cs.LG])
    The \emph{grokking phenomenon} as reported by Power et al.~\cite{power2021grokking} refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the \emph{Slingshot Mechanism}. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. We empirically observe that without explicit regularization, Grokking as reported in \cite{power2021grokking} almost exclusively happens at the onset of \emph{Slingshots}, and is absent without it. While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination. Our work points to a surprising and useful inductive bias of adaptive gradient optimizers at late stages of training, calling for a revised theoretical analysis of their origin.
    Beyond the Gates of Euclidean Space: Temporal-Discrimination-Fusions and Attention-based Graph Neural Network for Human Activity Recognition. (arXiv:2206.04855v1 [cs.LG])
    Human activity recognition (HAR) through wearable devices has received much interest due to its numerous applications in fitness tracking, wellness screening, and supported living. As a result, we have seen a great deal of work in this field. Traditional deep learning (DL) has set a state of the art performance for HAR domain. However, it ignores the data's structure and the association between consecutive time stamps. To address this constraint, we offer an approach based on Graph Neural Networks (GNNs) for structuring the input representation and exploiting the relations among the samples. However, even when using a simple graph convolution network to eliminate this shortage, there are still several limiting factors, such as inter-class activities issues, skewed class distribution, and a lack of consideration for sensor data priority, all of which harm the HAR model's performance. To improve the current HAR model's performance, we investigate novel possibilities within the framework of graph structure to achieve highly discriminated and rich activity features. We propose a model for (1) time-series-graph module that converts raw data from HAR dataset into graphs; (2) Graph Convolutional Neural Networks (GCNs) to discover local dependencies and correlations between neighboring nodes; and (3) self-attention GNN encoder to identify sensors interactions and data priorities. To the best of our knowledge, this is the first work for HAR, which introduces a GNN-based approach that incorporates both the GCN and the attention mechanism. By employing a uniform evaluation method, our framework significantly improves the performance on hospital patient's activities dataset comparatively considered other state of the art baseline methods.
    Conformal Prediction Intervals for Markov Decision Process Trajectories. (arXiv:2206.04860v1 [cs.LG])
    Before delegating a task to an autonomous system, a human operator may want a guarantee about the behavior of the system. This paper extends previous work on conformal prediction for functional data and conformalized quantile regression to provide conformal prediction intervals over the future behavior of an autonomous system executing a fixed control policy on a Markov Decision Process (MDP). The prediction intervals are constructed by applying conformal corrections to prediction intervals computed by quantile regression. The resulting intervals guarantee that with probability $1-\delta$ the observed trajectory will lie inside the prediction interval, where the probability is computed with respect to the starting state distribution and the stochasticity of the MDP. The method is illustrated on MDPs for invasive species management and StarCraft2 battles.
    Imitation Learning via Differentiable Physics. (arXiv:2206.04873v1 [cs.LG])
    Existing imitation learning (IL) methods such as inverse reinforcement learning (IRL) usually have a double-loop training process, alternating between learning a reward function and a policy and tend to suffer long training time and high variance. In this work, we identify the benefits of differentiable physics simulators and propose a new IL method, i.e., Imitation Learning via Differentiable Physics (ILD), which gets rid of the double-loop design and achieves significant improvements in final performance, convergence speed, and stability. The proposed ILD incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning. It unrolls the dynamics by sampling actions from a parameterized policy, simply minimizing the distance between the expert trajectory and the agent trajectory, and back-propagating the gradient into the policy via temporal physics operators. With the physics prior, ILD policies can not only be transferable to unseen environment specifications but also yield higher final performance on a variety of tasks. In addition, ILD naturally forms a single-loop structure, which significantly improves the stability and training speed. To simplify the complex optimization landscape induced by temporal physics operations, ILD dynamically selects the learning objectives for each state during optimization. In our experiments, we show that ILD outperforms state-of-the-art methods in a variety of continuous control tasks with Brax, requiring only one expert demonstration. In addition, ILD can be applied to challenging deformable object manipulation tasks and can be generalized to unseen configurations.
    Efficient Per-Shot Convex Hull Prediction By Recurrent Learning. (arXiv:2206.04877v1 [eess.IV])
    Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step is equivalent to an exhaustive search process over the space of possible encoding parameters, which causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we propose a deep learning based method of content aware convex hull prediction. We employ a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. A two-step transfer learning scheme is adopted to train our proposed RCN-Hull model, which ensures sufficient content diversity to analyze scene complexity, while also making it possible capture the scene statistics of pristine source videos. Our experimental results reveal that our proposed model yields better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches. On average, the pre-encoding time was reduced by 58.0% by our method, while the average Bjontegaard delta bitrate (BD-rate) of the predicted convex hulls against ground truth was 0.08%, while the mean absolute deviation of the BD-rate distribution was 0.44%
    NAGphormer: Neighborhood Aggregation Graph Transformer for Node Classification in Large Graphs. (arXiv:2206.04910v1 [cs.LG])
    Graph Transformers have demonstrated superiority on various graph learning tasks in recent years. However, the complexity of existing Graph Transformers scales quadratically with the number of nodes, making it hard to scale to graphs with thousands of nodes. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that is scalable to large graphs with millions of nodes. Before feeding the node features into the Transformer model, NAGphormer constructs tokens for each node by a neighborhood aggregation module called Hop2Token. For each node, Hop2Token aggregates neighborhood features from each hop into a representation, and thereby produces a sequence of token vectors. Subsequently, the resulting sequence of different hop information serves as input to the Transformer model. By considering each node as a sequence, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. NAGphormer further develops an attention-based readout function so as to learn the importance of each hop adaptively. We conduct extensive experiments on various popular benchmarks, including six small datasets and three large datasets. The results demonstrate that NAGphormer consistently outperforms existing Graph Transformers and mainstream Graph Neural Networks.
    HDTorch: Accelerating Hyperdimensional Computing with GP-GPUs for Design Space Exploration. (arXiv:2206.04746v1 [cs.LG])
    HyperDimensional Computing (HDC) as a machine learning paradigm is highly interesting for applications involving continuous, semi-supervised learning for long-term monitoring. However, its accuracy is not yet on par with other Machine Learning (ML) approaches. Frameworks enabling fast design space exploration to find practical algorithms are necessary to make HD computing competitive with other ML techniques. To this end, we introduce HDTorch, an open-source, PyTorch-based HDC library with CUDA extensions for hypervector operations. We demonstrate HDTorch's utility by analyzing four HDC benchmark datasets in terms of accuracy, runtime, and memory consumption, utilizing both classical and online HD training methodologies. We demonstrate average (training)/inference speedups of (111x/68x)/87x for classical/online HD, respectively. Moreover, we analyze the effects of varying hyperparameters on runtime and accuracy. Finally, we demonstrate how HDTorch enables exploration of HDC strategies applied to large, real-world datasets. We perform the first-ever HD training and inference analysis of the entirety of the CHB-MIT EEG epilepsy database. Results show that the typical approach of training on a subset of the data does not necessarily generalize to the entire dataset, an important factor when developing future HD models for medical wearable devices.
    Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation. (arXiv:2206.04785v1 [cs.CV])
    Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while leading to a 22% drop in parameters compared to the state-of-the-art.
    Crust Macrofracturing as the Evidence of the Last Deglaciation. (arXiv:2206.02652v2 [physics.geo-ph] UPDATED)
    Machine learning methods were applied to reconsider the results of several passive seismic experiments in Finland. We created datasets from different stages of the receiver function technique and processed them with one of basic machine learning algorithms. All the results were obtained uniformly with the $k$-nearest neighbors algorithm. The first result is the Moho depth map of the region. Another result is the delineation of the near-surface low $S$-wave velocity layer. There are three such areas in the Northern, Southern, and central parts of the region. The low $S$-wave velocity in the Northern and Southern areas can be linked to the geological structure. However, we attribute the central low $S$-wave velocity area to a large number of water-saturated cracks in the upper 1-5 km. Analysis of the structure of this area leads us to the conclusion that macrofracturing was caused by the last deglaciation.
    On the Bias-Variance Characteristics of LIME and SHAP in High Sparsity Movie Recommendation Explanation Tasks. (arXiv:2206.04784v1 [cs.LG])
    We evaluate two popular local explainability techniques, LIME and SHAP, on a movie recommendation task. We discover that the two methods behave very differently depending on the sparsity of the data set. LIME does better than SHAP in dense segments of the data set and SHAP does better in sparse segments. We trace this difference to the differing bias-variance characteristics of the underlying estimators of LIME and SHAP. We find that SHAP exhibits lower variance in sparse segments of the data compared to LIME. We attribute this lower variance to the completeness constraint property inherent in SHAP and missing in LIME. This constraint acts as a regularizer and therefore increases the bias of the SHAP estimator but decreases its variance, leading to a favorable bias-variance trade-off especially in high sparsity data settings. With this insight, we introduce the same constraint into LIME and formulate a novel local explainabilty framework called Completeness-Constrained LIME (CLIMB) that is superior to LIME and much faster than SHAP.
    Syntactic Inductive Biases for Deep Learning Methods. (arXiv:2206.04806v1 [cs.LG])
    In this thesis, we try to build a connection between the two schools by introducing syntactic inductive biases for deep learning models. We propose two families of inductive biases, one for constituency structure and another one for dependency structure. The constituency inductive bias encourages deep learning models to use different units (or neurons) to separately process long-term and short-term information. This separation provides a way for deep learning models to build the latent hierarchical representations from sequential inputs, that a higher-level representation is composed of and can be decomposed into a series of lower-level representations. For example, without knowing the ground-truth structure, our proposed model learns to process logical expression through composing representations of variables and operators into representations of expressions according to its syntactic structure. On the other hand, the dependency inductive bias encourages models to find the latent relations between entities in the input sequence. For natural language, the latent relations are usually modeled as a directed dependency graph, where a word has exactly one parent node and zero or several children nodes. After applying this constraint to a Transformer-like model, we find the model is capable of inducing directed graphs that are close to human expert annotations, and it also outperforms the standard transformer model on different tasks. We believe that these experimental results demonstrate an interesting alternative for the future development of deep learning models.  ( 2 min )
    Data-Efficient Double-Win Lottery Tickets from Robust Pre-training. (arXiv:2206.04762v1 [cs.LG])
    Pre-training serves as a broadly adopted starting point for transfer learning on various downstream tasks. Recent investigations of lottery tickets hypothesis (LTH) demonstrate such enormous pre-trained models can be replaced by extremely sparse subnetworks (a.k.a. matching subnetworks) without sacrificing transferability. However, practical security-crucial applications usually pose more challenging requirements beyond standard transfer, which also demand these subnetworks to overcome adversarial vulnerability. In this paper, we formulate a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre-trained model can do. We comprehensively examine various pre-training mechanisms and find that robust pre-training tends to craft sparser double-win lottery tickets with superior performance over the standard counterparts. For example, on downstream CIFAR-10/100 datasets, we identify double-win matching subnetworks with the standard, fast adversarial, and adversarial pre-training from ImageNet, at 89.26%/73.79%, 89.26%/79.03%, and 91.41%/83.22% sparsity, respectively. Furthermore, we observe the obtained double-win lottery tickets can be more data-efficient to transfer, under practical data-limited (e.g., 1% and 10%) downstream schemes. Our results show that the benefits from robust pre-training are amplified by the lottery ticket scheme, as well as the data-limited transfer setting. Codes are available at https://github.com/VITA-Group/Double-Win-LTH.
    Motif Mining and Unsupervised Representation Learning for BirdCLEF 2022. (arXiv:2206.04805v1 [cs.SD])
    We build a classification model for the BirdCLEF 2022 challenge using unsupervised methods. We implement an unsupervised representation of the training dataset using a triplet loss on spectrogram representation of audio motifs. Our best model performs with a score of 0.48 on the public leaderboard.
    I'm Me, We're Us, and I'm Us: Tri-directional Contrastive Learning on Hypergraphs. (arXiv:2206.04739v1 [cs.LG])
    Although machine learning on hypergraphs has attracted considerable attention, most of the works have focused on (semi-)supervised learning, which may cause heavy labeling costs and poor generalization. Recently, contrastive learning has emerged as a successful unsupervised representation learning method. Despite the prosperous development of contrastive learning in other domains, contrastive learning on hypergraphs remains little explored. In this paper, we propose TriCon (Tri-directional Contrastive learning), a general framework for contrastive learning on hypergraphs. Its main idea is tri-directional contrast, and specifically, it aims to maximize in two augmented views the agreement (a) between the same node, (b) between the same group of nodes, and (c) between each group and its members. Together with simple but surprisingly effective data augmentation and negative sampling schemes, these three forms of contrast enable TriCon to capture both microscopic and mesoscopic structural information in node embeddings. Our extensive experiments using 13 baseline approaches, five datasets, and two tasks demonstrate the effectiveness of TriCon, and most noticeably, TriCon consistently outperforms not just unsupervised competitors but also (semi-)supervised competitors mostly by significant margins for node classification.
    Comprehensive Fair Meta-learned Recommender System. (arXiv:2206.04789v1 [cs.IR])
    In recommender systems, one common challenge is the cold-start problem, where interactions are very limited for fresh users in the systems. To address this challenge, recently, many works introduce the meta-optimization idea into the recommendation scenarios, i.e. learning to learn the user preference by only a few past interaction items. The core idea is to learn global shared meta-initialization parameters for all users and rapidly adapt them into local parameters for each user respectively. They aim at deriving general knowledge across preference learning of various users, so as to rapidly adapt to the future new user with the learned prior and a small amount of training data. However, previous works have shown that recommender systems are generally vulnerable to bias and unfairness. Despite the success of meta-learning at improving the recommendation performance with cold-start, the fairness issues are largely overlooked. In this paper, we propose a comprehensive fair meta-learning framework, named CLOVER, for ensuring the fairness of meta-learned recommendation models. We systematically study three kinds of fairness - individual fairness, counterfactual fairness, and group fairness in the recommender systems, and propose to satisfy all three kinds via a multi-task adversarial learning scheme. Our framework offers a generic training paradigm that is applicable to different meta-learned recommender systems. We demonstrate the effectiveness of CLOVER on the representative meta-learned user preference estimator on three real-world data sets. Empirical results show that CLOVER achieves comprehensive fairness without deteriorating the overall cold-start recommendation performance.  ( 2 min )
    Communication Efficient Distributed Learning for Kernelized Contextual Bandits. (arXiv:2206.04835v1 [cs.LG])
    We tackle the communication efficiency challenge of learning kernelized contextual bandits in a distributed setting. Despite the recent advances in communication-efficient distributed bandit learning, existing solutions are restricted to simple models like multi-armed bandits and linear bandits, which hamper their practical utility. In this paper, instead of assuming the existence of a linear reward mapping from the features to the expected rewards, we consider non-linear reward mappings, by letting agents collaboratively search in a reproducing kernel Hilbert space (RKHS). This introduces significant challenges in communication efficiency as distributed kernel learning requires the transfer of raw data, leading to a communication cost that grows linearly w.r.t. time horizon $T$. We addresses this issue by equipping all agents to communicate via a common Nystr\"{o}m embedding that gets updated adaptively as more data points are collected. We rigorously proved that our algorithm can attain sub-linear rate in both regret and communication cost.
    Trimmed Maximum Likelihood Estimation for Robust Learning in Generalized Linear Models. (arXiv:2206.04777v1 [cs.LG])
    We study the problem of learning generalized linear models under adversarial corruptions. We analyze a classical heuristic called the iterative trimmed maximum likelihood estimator which is known to be effective against label corruptions in practice. Under label corruptions, we prove that this simple estimator achieves minimax near-optimal risk on a wide range of generalized linear models, including Gaussian regression, Poisson regression and Binomial regression. Finally, we extend the estimator to the more challenging setting of label and covariate corruptions and demonstrate its robustness and optimality in that setting as well.  ( 2 min )
    Deep Leakage from Model in Federated Learning. (arXiv:2206.04887v1 [cs.LG])
    Distributed machine learning has been widely used in recent years to tackle the large and complex dataset problem. Therewith, the security of distributed learning has also drawn increasing attentions from both academia and industry. In this context, federated learning (FL) was developed as a "secure" distributed learning by maintaining private training data locally and only public model gradients are communicated between. However, to date, a variety of gradient leakage attacks have been proposed for this procedure and prove that it is insecure. For instance, a common drawback of these attacks is shared: they require too much auxiliary information such as model weights, optimizers, and some hyperparameters (e.g., learning rate), which are difficult to obtain in real situations. Moreover, many existing algorithms avoid transmitting model gradients in FL and turn to sending model weights, such as FedAvg, but few people consider its security breach. In this paper, we present two novel frameworks to demonstrate that transmitting model weights is also likely to leak private local data of clients, i.e., (DLM and DLM+), under the FL scenario. In addition, a number of experiments are performed to illustrate the effect and generality of our attack frameworks. At the end of this paper, we also introduce two defenses to the proposed attacks and evaluate their protection effects. Comprehensively, the proposed attack and defense schemes can be applied to the general distributed learning scenario as well, just with some appropriate customization.
    Connecting Low-Loss Subspace for Personalized Federated Learning. (arXiv:2109.07628v2 [cs.LG] UPDATED)
    Due to the curse of statistical heterogeneity across clients, adopting a personalized federated learning method has become an essential choice for the successful deployment of federated learning-based services. Among diverse branches of personalization techniques, a model mixture-based personalization method is preferred as each client has their own personalized model as a result of federated learning. It usually requires a local model and a federated model, but this approach is either limited to partial parameter exchange or requires additional local updates, each of which is helpless to novel clients and burdensome to the client's computational capacity. As the existence of a connected subspace containing diverse low-loss solutions between two or more independent deep networks has been discovered, we combined this interesting property with the model mixture-based personalized federated learning method for improved performance of personalization. We proposed SuPerFed, a personalized federated learning method that induces an explicit connection between the optima of the local and the federated model in weight space for boosting each other. Through extensive experiments on several benchmark datasets, we demonstrated that our method achieves consistent gains in both personalization performance and robustness to problematic scenarios possible in realistic services.
    Swan: A Neural Engine for Efficient DNN Training on Smartphone SoCs. (arXiv:2206.04687v1 [cs.LG])
    The need to train DNN models on end-user devices (e.g., smartphones) is increasing with the need to improve data privacy and reduce communication overheads. Unlike datacenter servers with powerful CPUs and GPUs, modern smartphones consist of a diverse collection of specialized cores following a system-on-a-chip (SoC) architecture that together perform a variety of tasks. We observe that training DNNs on a smartphone SoC without carefully considering its resource constraints can not only lead to suboptimal training performance but significantly affect user experience as well. In this paper, we present Swan, a neural engine to optimize DNN training on smartphone SoCs without hurting user experience. Extensive large-scale evaluations show that Swan can improve performance by 1.2 - 23.3x over the state-of-the-art.
    NNTrainer: Light-Weight On-Device Training Framework. (arXiv:2206.04688v1 [cs.LG])
    Modern consumer electronic devices have adopted deep learning-based intelligence services for their key features. Vendors have recently started to execute intelligence services on devices to preserve personal data in devices, reduce network and cloud costs. We find such a trend as the opportunity to personalize intelligence services by updating neural networks with user data without exposing the data out of devices: on-device training. For example, we may add a new class, my dog, Alpha, for robotic vacuums, adapt speech recognition for the users accent, let text-to-speech speak as if the user speaks. However, the resource limitations of target devices incur significant difficulties. We propose NNTrainer, a light-weight on-device training framework. We describe optimization techniques for neural networks implemented by NNTrainer, which are evaluated along with the conventional. The evaluations show that NNTrainer can reduce memory consumption down to 1/28 without deteriorating accuracy or training time and effectively personalizes applications on devices. NNTrainer is cross-platform and practical open source software, which is being deployed to millions of devices in the authors affiliation.
    Learning to Efficiently Propagate for Reasoning on Knowledge Graphs. (arXiv:2206.04798v1 [cs.AI])
    Path-based methods are more appealing solutions than embedding methods for knowledge graph reasoning, due to their interpretability and generalization ability to unseen graphs. However, path-based methods usually suffer from the problem of scalability, as the time complexity grows exponentially w.r.t. the length of paths. While recent methods compute reasoning paths with the Bellman-Ford algorithm in polynomial time, the time and memory cost remains very high, as they need to propagate through all the nodes and edges in the graph. In this paper, we propose A*Net, an efficient model for path-based reasoning on knowledge graphs. Inspired by the classical A* algorithm for shortest path problems, our A*Net prioritizes important nodes and edges at each propagation step, to reduce the time and memory footprint. Unlike the classical A* algorithm that uses a heuristic function, we propose to learn the priority function for each node to capture the complex semantics in knowledge graphs. The priority function and the propagation steps are jointly optimized through backpropagation. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, and meanwhile reduces the number of messages, the time and the memory cost up to 7.2$\times$, 3.4$\times$ and 4.9$\times$ respectively.
    COSTA: Covariance-Preserving Feature Augmentation for Graph Contrastive Learning. (arXiv:2206.04726v1 [cs.LG])
    Graph contrastive learning (GCL) improves graph representation learning, leading to SOTA on various downstream tasks. The graph augmentation step is a vital but scarcely studied step of GCL. In this paper, we show that the node embedding obtained via the graph augmentations is highly biased, somewhat limiting contrastive models from learning discriminative features for downstream tasks.Thus, instead of investigating graph augmentation in the input space, we alternatively propose to perform augmentations on the hidden features (feature augmentation). Inspired by so-called matrix sketching, we propose COSTA, a novel COvariance-preServing feaTure space Augmentation framework for GCL, which generates augmented features by maintaining a ``good sketch'' of original features. To highlight the superiority of feature augmentation with COSTA, we investigate a single-view setting (in addition to multi-view one) which conserves memory and computations. We show that the feature augmentation with COSTA achieves comparable/better results than graph augmentation based models.
    Leveraging Centric Data Federated Learning Using Blockchain For Integrity Assurance. (arXiv:2206.04731v1 [cs.LG])
    Machine learning abilities have become a vital component for various solutions across industries, applications, and sectors. Many organizations seek to leverage AI-based solutions across their business services to unlock better efficiency and increase productivity. Problems, however, can arise if there is a lack of quality data for AI-model training, scalability, and maintenance. We propose a data-centric federated learning architecture leveraged by a public blockchain and smart contracts to overcome this significant issue. Our proposed solution provides a virtual public marketplace where developers, data scientists, and AI-engineer can publish their models and collaboratively create and access quality data for training. We enhance data quality and integrity through an incentive mechanism that rewards contributors for data contribution and verification. Those combined with the proposed framework helped increase with only one user simulation the training dataset with an average of 100 input daily and the model accuracy by approximately 4\%.
    Sparsity in Partially Controllable Linear Systems. (arXiv:2110.06150v2 [math.OC] UPDATED)
    A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the control inputs; such systems are only partially controllable. The focus of this work is on a large class of partially controllable linear dynamical systems, specified by an underlying sparsity pattern. Our main results establish structural conditions and finite-sample guarantees for learning to control such systems. In particular, our structural results characterize those state variables which are irrelevant for optimal control, an analysis which departs from classical control techniques. Our algorithmic results adapt techniques from high-dimensional statistics -- specifically soft-thresholding and semiparametric least-squares -- to exploit the underlying sparsity pattern in order to obtain finite-sample guarantees that significantly improve over those based on certainty-equivalence. We also corroborate these theoretical improvements over certainty-equivalent control through a simulation study.
    Robust Factorization of Real-world Tensor Streams with Patterns, Missing Values, and Outliers. (arXiv:2102.08466v2 [cs.LG] UPDATED)
    Consider multiple seasonal time series being collected in real-time, in the form of a tensor stream. Real-world tensor streams often include missing entries (e.g., due to network disconnection) and at the same time unexpected outliers (e.g., due to system errors). Given such a real-world tensor stream, how can we estimate missing entries and predict future evolution accurately in real-time? In this work, we answer this question by introducing SOFIA, a robust factorization method for real-world tensor streams. In a nutshell, SOFIA smoothly and tightly integrates tensor factorization, outlier removal, and temporal-pattern detection, which naturally reinforce each other. Moreover, SOFIA integrates them in linear time, in an online manner, despite the presence of missing entries. We experimentally show that SOFIA is (a) robust and accurate: yielding up to 76% lower imputation error and 71% lower forecasting error; (b) fast: up to 935X faster than the second-most accurate competitor; and (c) scalable: scaling linearly with the number of new entries per time step.
    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?. (arXiv:2206.05266v1 [cs.LG])
    We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform an evolutionary search to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. Often, the use of self-supervised losses under the existing framework lowered RL performances. We evaluate the approach in multiple different environments including a real-world robot environment and confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we empirically investigate the pretraining framework for SSL + RL and the properties of representations learned with different approaches.
    Balanced Product of Experts for Long-Tailed Recognition. (arXiv:2206.05260v1 [cs.CV])
    Many real-world recognition problems suffer from an imbalanced or long-tailed label distribution. Those distributions make representation learning more challenging due to limited generalization over the tail classes. If the test distribution differs from the training distribution, e.g. uniform versus long-tailed, the problem of the distribution shift needs to be addressed. To this aim, recent works have extended softmax cross-entropy using margin modifications, inspired by Bayes' theorem. In this paper, we generalize several approaches with a Balanced Product of Experts (BalPoE), which combines a family of models with different test-time target distributions to tackle the imbalance in the data. The proposed experts are trained in a single stage, either jointly or independently, and fused seamlessly into a BalPoE. We show that BalPoE is Fisher consistent for minimizing the balanced error and perform extensive experiments to validate the effectiveness of our approach. Finally, we investigate the effect of Mixup in this setting, discovering that regularization is a key ingredient for learning calibrated experts. Our experiments show that a regularized BalPoE can perform remarkably well in test accuracy and calibration metrics, leading to state-of-the-art results on CIFAR-100-LT, ImageNet-LT, and iNaturalist-2018 datasets. The code will be made publicly available upon paper acceptance.
    List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering. (arXiv:2206.05245v1 [cs.DS])
    We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\widehat \mu$ such that $\| \widehat \mu - \mu \|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ``certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\mathrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.
    On Convergence of FedProx: Local Dissimilarity Invariant Bounds, Non-smoothness and Beyond. (arXiv:2206.05187v1 [stat.ML])
    The FedProx algorithm is a simple yet powerful distributed proximal point optimization method widely used for federated learning (FL) over heterogeneous data. Despite its popularity and remarkable success witnessed in practice, the theoretical understanding of FedProx is largely underinvestigated: the appealing convergence behavior of FedProx is so far characterized under certain non-standard and unrealistic dissimilarity assumptions of local functions, and the results are limited to smooth optimization problems. In order to remedy these deficiencies, we develop a novel local dissimilarity invariant convergence theory for FedProx and its minibatch stochastic extension through the lens of algorithmic stability. As a result, we contribute to derive several new and deeper insights into FedProx for non-convex federated optimization including: 1) convergence guarantees independent on local dissimilarity type conditions; 2) convergence guarantees for non-smooth FL problems; and 3) linear speedup with respect to size of minibatch and number of sampled devices. Our theory for the first time reveals that local dissimilarity and smoothness are not must-have for FedProx to get favorable complexity bounds. Preliminary experimental results on a series of benchmark FL datasets are reported to demonstrate the benefit of minibatching for improving the sample efficiency of FedProx.
    Weakly-supervised segmentation using inherently-explainable classification models and their application to brain tumour classification. (arXiv:2206.05148v1 [eess.IV])
    Deep learning models have shown their potential for several applications. However, most of the models are opaque and difficult to trust due to their complex reasoning - commonly known as the black-box problem. Some fields, such as medicine, require a high degree of transparency to accept and adopt such technologies. Consequently, creating explainable/interpretable models or applying post-hoc methods on classifiers to build trust in deep learning models are required. Moreover, deep learning methods can be used for segmentation tasks, which typically require hard-to-obtain, time-consuming manually-annotated segmentation labels for training. This paper introduces three inherently-explainable classifiers to tackle both of these problems as one. The localisation heatmaps provided by the networks -- representing the models' focus areas and being used in classification decision-making -- can be directly interpreted, without requiring any post-hoc methods to derive information for model explanation. The models are trained by using the input image and only the classification labels as ground-truth in a supervised fashion - without using any information about the location of the region of interest (i.e. the segmentation labels), making the segmentation training of the models weakly-supervised through classification labels. The final segmentation is obtained by thresholding these heatmaps. The models were employed for the task of multi-class brain tumour classification using two different datasets, resulting in the best F1-score of 0.93 for the supervised classification task while securing a median Dice score of 0.67$\pm$0.08 for the weakly-supervised segmentation task. Furthermore, the obtained accuracy on a subset of tumour-only images outperformed the state-of-the-art glioma tumour grading binary classifiers with the best model achieving 98.7\% accuracy.
    Meta-data Study in Autism Spectrum Disorder Classification Based on Structural MRI. (arXiv:2206.05052v1 [cs.LG])
    Accurate diagnosis of autism spectrum disorder (ASD) based on neuroimaging data has significant implications, as extracting useful information from neuroimaging data for ASD detection is challenging. Even though machine learning techniques have been leveraged to improve the information extraction from neuroimaging data, the varying data quality caused by different meta-data conditions (i.e., data collection strategies) limits the effective information that can be extracted, thus leading to data-dependent predictive accuracies in ASD detection, which can be worse than random guess in some cases. In this work, we systematically investigate the impact of three kinds of meta-data on the predictive accuracy of classifying ASD based on structural MRI collected from 20 different sites, where meta-data conditions vary.
    Human-AI Interaction Design in Machine Teaching. (arXiv:2206.05182v1 [cs.HC])
    Machine Teaching (MT) is an interactive process where a human and a machine interact with the goal of training a machine learning model (ML) for a specified task. The human teacher communicates their task expertise and the machine student gathers the required data and knowledge to produce an ML model. MT systems are developed to jointly minimize the time spent on teaching and the learner's error rate. The design of human-AI interaction in an MT system not only impacts the teaching efficiency, but also indirectly influences the ML performance by affecting the teaching quality. In this paper, we build upon our previous work where we proposed an MT framework with three components, viz., the teaching interface, the machine learner, and the knowledge base, and focus on the human-AI interaction design involved in realizing the teaching interface. We outline design decisions that need to be addressed in developing an MT system beginning from an ML task. The paper follows the Socratic method entailing a dialogue between a curious student and a wise teacher.
    Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptoms. (arXiv:2206.05053v1 [cs.HC])
    The COVID-19 pandemic has accelerated research on design of alternative, quick and effective COVID-19 diagnosis approaches. In this paper, we describe the Coswara tool, a website application designed to enable COVID-19 detection by analysing respiratory sound samples and health symptoms. A user using this service can log into a website using any device connected to the internet, provide there current health symptom information and record few sound sampled corresponding to breathing, cough, and speech. Within a minute of analysis of this information on a cloud server the website tool will output a COVID-19 probability score to the user. As the COVID-19 pandemic continues to demand massive and scalable population level testing, we hypothesize that the proposed tool provides a potential solution towards this.
    Fast Deep Autoencoder for Federated learning. (arXiv:2206.05136v1 [cs.LG])
    This paper presents a novel, fast and privacy preserving implementation of deep autoencoders. DAEF (Deep Autoencoder for Federated learning), unlike traditional neural networks, trains a deep autoencoder network in a non-iterative way, which drastically reduces its training time. Its training can be carried out in a distributed way (several partitions of the dataset in parallel) and incrementally (aggregation of partial models), and due to its mathematical formulation, the data that is exchanged does not endanger the privacy of the users. This makes DAEF a valid method for edge computing and federated learning scenarios. The method has been evaluated and compared to traditional (iterative) deep autoencoders using seven real anomaly detection datasets, and their performance have been shown to be similar despite DAEF's faster training.
    Provable Guarantees for Sparsity Recovery with Deterministic Missing Data Patterns. (arXiv:2206.04893v1 [cs.LG])
    We study the problem of consistently recovering the sparsity pattern of a regression parameter vector from correlated observations governed by deterministic missing data patterns using Lasso. We consider the case in which the observed dataset is censored by a deterministic, non-uniform filter. Recovering the sparsity pattern in datasets with deterministic missing structure can be arguably more challenging than recovering in a uniformly-at-random scenario. In this paper, we propose an efficient algorithm for missing value imputation by utilizing the topological property of the censorship filter. We then provide novel theoretical results for exact recovery of the sparsity pattern using the proposed imputation strategy. Our analysis shows that, under certain statistical and topological conditions, the hidden sparsity pattern can be recovered consistently with high probability in polynomial time and logarithmic sample complexity.
    Out of Sight, Out of Mind: A Source-View-Wise Feature Aggregation for Multi-View Image-Based Rendering. (arXiv:2206.04906v1 [cs.CV])
    To estimate the volume density and color of a 3D point in the multi-view image-based rendering, a common approach is to inspect the consensus existence among the given source image features, which is one of the informative cues for the estimation procedure. To this end, most of the previous methods utilize equally-weighted aggregation features. However, this could make it hard to check the consensus existence when some outliers, which frequently occur by occlusions, are included in the source image feature set. In this paper, we propose a novel source-view-wise feature aggregation method, which facilitates us to find out the consensus in a robust way by leveraging local structures in the feature set. We first calculate the source-view-wise distance distribution for each source feature for the proposed aggregation. After that, the distance distribution is converted to several similarity distributions with the proposed learnable similarity mapping functions. Finally, for each element in the feature set, the aggregation features are extracted by calculating the weighted means and variances, where the weights are derived from the similarity distributions. In experiments, we validate the proposed method on various benchmark datasets, including synthetic and real image scenes. The experimental results demonstrate that incorporating the proposed features improves the performance by a large margin, resulting in the state-of-the-art performance.
    Hierarchical mixtures of Gaussians for combined dimensionality reduction and clustering. (arXiv:2206.04841v1 [cs.LG])
    To avoid the curse of dimensionality, a common approach to clustering high-dimensional data is to first project the data into a space of reduced dimension, and then cluster the projected data. Although effective, this two-stage approach prevents joint optimization of the dimensionality-reduction and clustering models, and obscures how well the complete model describes the data. Here, we show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG). An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function. By formulating and extending existing models with exponential family theory, we show how to maximize the likelihood of HMoGs with expectation-maximization. We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models. Ultimately, HMoGs are a rigorous generalization of a common statistical framework, and provide researchers with a method to improve model performance when clustering high-dimensional data.
    Binarizing Split Learning for Data Privacy Enhancement and Computation Reduction. (arXiv:2206.04864v1 [cs.LG])
    Split learning (SL) enables data privacy preservation by allowing clients to collaboratively train a deep learning model with the server without sharing raw data. However, SL still has limitations such as potential data privacy leakage and high computation at clients. In this study, we propose to binarize the SL local layers for faster computation (up to 17.5 times less forward-propagation time in both training and inference phases on mobile devices) and reduced memory usage (up to 32 times less memory and bandwidth requirements). More importantly, the binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy. To further enhance the privacy preservation, we also propose two novel approaches: 1) training with additional local leak loss and 2) applying differential privacy, which could be integrated separately or concurrently into the B-SL model. Experimental results with different datasets have affirmed the advantages of the B-SL models compared with several benchmark models. The effectiveness of B-SL models against feature-space hijacking attack (FSHA) is also illustrated. Our results have demonstrated B-SL models are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.
    Stable and memory-efficient image recovery using monotone operator learning (MOL). (arXiv:2206.04797v1 [cs.CV])
    We introduce a monotone deep equilibrium learning framework for large-scale inverse problems in imaging. The proposed algorithm relies on forward-backward splitting, where each iteration consists of a gradient descent involving the score function and a conjugate gradient algorithm to encourage data consistency. The score function is modeled as a monotone convolutional neural network. The use of a monotone operator offers several benefits, including guaranteed convergence, uniqueness of fixed point, and robustness to input perturbations, similar to the use of convex priors in compressive sensing. In addition, the proposed formulation is significantly more memory-efficient than unrolled methods, which allows us to apply it to 3D problems that current unrolled algorithms cannot handle. Experiments show that the proposed scheme can offer improved performance in 3D settings while being stable in the presence of input perturbations.
    Deep learning-enhanced ensemble-based data assimilation for high-dimensional nonlinear dynamical systems. (arXiv:2206.04811v1 [cs.LG])
    Data assimilation (DA) is a key component of many forecasting models in science and engineering. DA allows one to estimate better initial conditions using an imperfect dynamical model of the system and noisy/sparse observations available from the system. Ensemble Kalman filter (EnKF) is a DA algorithm that is widely used in applications involving high-dimensional nonlinear dynamical systems. However, EnKF requires evolving large ensembles of forecasts using the dynamical model of the system. This often becomes computationally intractable, especially when the number of states of the system is very large, e.g., for weather prediction. With small ensembles, the estimated background error covariance matrix in the EnKF algorithm suffers from sampling error, leading to an erroneous estimate of the analysis state (initial condition for the next forecast cycle). In this work, we propose hybrid ensemble Kalman filter (H-EnKF), which is applied to a two-layer quasi-geostrophic flow system as a test case. This framework utilizes a pre-trained deep learning-based data-driven surrogate that inexpensively generates and evolves a large data-driven ensemble of the states of the system to accurately compute the background error covariance matrix with less sampling error. The H-EnKF framework estimates a better initial condition without the need for any ad-hoc localization strategies. H-EnKF can be extended to any ensemble-based DA algorithm, e.g., particle filters, which are currently difficult to use for high dimensional systems.
    Deep Auto-encoder with Neural Response. (arXiv:2111.15309v2 [cs.LG] UPDATED)
    Artificial neural network (ANN) is a versatile tool to study the neural representation in the ventral visual stream, and the knowledge in neuroscience in return inspires ANN models to improve performance in the task. However, it is still unclear how to merge these two directions into a unified framework. In this study, we propose an integrated framework called Deep Autoencoder with Neural Response (DAE-NR), which incorporates information from ANN and the visual cortex to achieve better image reconstruction performance and higher neural representation similarity between biological and artificial neurons. The same visual stimuli (i.e., natural images) are input to both the mice brain and DAE-NR. The encoder of DAE-NR jointly learns the dependencies from neural spike encoding and image reconstruction. For the neural spike encoding task, the features derived from a specific hidden layer of the encoder are transformed by a mapping function to predict the ground-truth neural response under the constraint of image reconstruction. Simultaneously, for the image reconstruction task, the latent representation obtained by the encoder is assigned to a decoder to restore the original image under the guidance of neural information. In DAE-NR, the learning process of encoder, mapping function and decoder are all implicitly constrained by these two tasks. Our experiments demonstrate that if and only if with the joint learning, DAE-NRs can improve the performance of visual image reconstruction and increase the representation similarity between biological neurons and artificial neurons. The DAE-NR offers a new perspective on the integration of computer vision and neuroscience.  ( 2 min )
    Gaussian Mixture Variational Autoencoder with Contrastive Learning for Multi-Label Classification. (arXiv:2112.00976v2 [cs.LG] UPDATED)
    Multi-label classification (MLC) is a prediction task where each sample can have more than one label. We propose a novel contrastive learning boosted multi-label prediction model based on a Gaussian mixture variational autoencoder (C-GMVAE), which learns a multimodal prior space and employs a contrastive loss. Many existing methods introduce extra complex neural modules like graph neural networks to capture the label correlations, in addition to the prediction modules. We find that by using contrastive learning in the supervised setting, we can exploit label information effectively in a data-driven manner, and learn meaningful feature and label embeddings which capture the label correlations and enhance the predictive power. Our method also adopts the idea of learning and aligning latent spaces for both features and labels. In contrast to previous works based on a unimodal prior, C-GMVAE imposes a Gaussian mixture structure on the latent space, to alleviate the posterior collapse and over-regularization issues. C-GMVAE outperforms existing methods on multiple public datasets and can often match other models' full performance with only 50% of the training data. Furthermore, we show that the learnt embeddings provide insights into the interpretation of label-label interactions.  ( 2 min )
    Solving PDEs on Unknown Manifolds with Machine Learning. (arXiv:2106.06682v2 [math.NA] UPDATED)
    This paper proposes a mesh-free computational framework and machine learning theory for solving elliptic PDEs on unknown manifolds, identified with point clouds, based on diffusion maps (DM) and deep learning. The PDE solver is formulated as a supervised learning task to solve a least-squares regression problem that imposes an algebraic equation approximating a PDE (and boundary conditions if applicable). This algebraic equation involves a graph-Laplacian type matrix obtained via DM asymptotic expansion, which is a consistent estimator of second-order elliptic differential operators. The resulting numerical method is to solve a highly non-convex empirical risk minimization problem subjected to a solution from a hypothesis space of neural networks. In a well-posed elliptic PDE setting, when the hypothesis space consists of neural networks with either infinite width or depth, we show that the global minimizer of the empirical loss function is a consistent solution in the limit of large training data. When the hypothesis space is a two-layer neural network, we show that for a sufficiently large width, gradient descent can identify a global minimizer of the empirical loss function. Supporting numerical examples demonstrate the convergence of the solutions, ranging from simple manifolds with low and high co-dimensions, to rough surfaces with and without boundaries. We also show that the proposed NN solver can robustly generalize the PDE solution on new data points with generalization errors that are almost identical to the training errors, superseding a Nystrom-based interpolation method.  ( 2 min )
    Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization. (arXiv:2106.11180v3 [math.OC] UPDATED)
    Established approaches to obtain generalization bounds in data-driven optimization and machine learning mostly build on solutions from empirical risk minimization (ERM), which depend crucially on the functional complexity of the hypothesis class. In this paper, we present an alternate route to obtain these bounds on the solution from distributionally robust optimization (DRO), a recent data-driven optimization framework based on worst-case analysis and the notion of ambiguity set to capture statistical uncertainty. In contrast to the hypothesis class complexity in ERM, our DRO bounds depend on the ambiguity set geometry and its compatibility with the true loss function. Notably, when using maximum mean discrepancy as a DRO distance metric, our analysis implies generalization bounds whose dependence on the hypothesis class appears the minimal possible: The bound depends solely on the true loss function, independent of any other candidates in the hypothesis class. To our best knowledge, it is the first generalization bound of this type in the literature, and we hope our findings can open the door for a better understanding of DRO, especially its benefits on loss minimization and other machine learning applications.  ( 2 min )
    Meta-Reinforcement Learning with Self-Modifying Networks. (arXiv:2202.02363v2 [cs.LG] UPDATED)
    Deep Reinforcement Learning has demonstrated the potential of neural networks tuned with gradient descent for solving complex tasks in well-delimited environments. However, these neural systems are slow learners producing specialised agents with no mechanism to continue learning beyond their training curriculum. On the contrary, biological synaptic plasticity is persistent and manifold, and has been hypothesised to play a key role in executive functions such as working memory and cognitive flexibility, potentially supporting more efficient and generic learning abilities. Inspired by this, we propose to build networks with dynamic weights, able to continually perform self-reflexive modification as a function of their current synaptic state and action-reward feedback, rather than a fixed network configuration. The resulting model, MetODS (for Meta-Optimized Dynamical Synapses) is a broadly applicable meta-reinforcement learning system able to learn efficient and powerful control rules in the agent policy space. A single layer with dynamic synapses can perform one-shot learning, generalize navigation principles to unseen environments and demonstrate a strong ability to learn adaptive motor policies, comparing favourably with previous meta-reinforcement learning approaches.  ( 2 min )
    Asymptotic Escape of Spurious Critical Points on the Low-rank Matrix Manifold. (arXiv:2107.09207v2 [math.OC] UPDATED)
    We show that on the manifold of fixed-rank and symmetric positive semi-definite matrices, the Riemannian gradient descent algorithm almost surely escapes some spurious critical points on the boundary of the manifold. Our result is the first to partially overcome the incompleteness of the low-rank matrix manifold without changing the vanilla Riemannian gradient descent algorithm. The spurious critical points are some rank-deficient matrices that capture only part of the eigen components of the ground truth. Unlike classical strict saddle points, they exhibit very singular behavior. We show that using the dynamical low-rank approximation and a rescaled gradient flow, some of the spurious critical points can be converted to classical strict saddle points in the parameterized domain, which leads to the desired result. Numerical experiments are provided to support our theoretical findings.  ( 2 min )
    Stochastic Continuous Submodular Maximization: Boosting via Non-oblivious Function. (arXiv:2201.00703v3 [cs.LG] UPDATED)
    In this paper, we revisit Stochastic Continuous Submodular Maximization in both offline and online settings, which can benefit wide applications in machine learning and operations research areas. We present a boosting framework covering gradient ascent and online gradient ascent. The fundamental ingredient of our methods is a novel non-oblivious function $F$ derived from a factor-revealing optimization problem, whose any stationary point provides a $(1-e^{-\gamma})$-approximation to the global maximum of the $\gamma$-weakly DR-submodular objective function $f\in C^{1,1}_L(\mathcal{X})$. Under the offline scenario, we propose a boosting gradient ascent method achieving $(1-e^{-\gamma}-\epsilon^{2})$-approximation after $O(1/\epsilon^2)$ iterations, which improves the $(\frac{\gamma^2}{1+\gamma^2})$ approximation ratio of the classical gradient ascent algorithm. In the online setting, for the first time we consider the adversarial delays for stochastic gradient feedback, under which we propose a boosting online gradient algorithm with the same non-oblivious function $F$. Meanwhile, we verify that this boosting online algorithm achieves a regret of $O(\sqrt{D})$ against a $(1-e^{-\gamma})$-approximation to the best feasible solution in hindsight, where $D$ is the sum of delays of gradient feedback. To the best of our knowledge, this is the first result to obtain $O(\sqrt{T})$ regret against a $(1-e^{-\gamma})$-approximation with $O(1)$ gradient inquiry at each time step, when no delay exists, i.e., $D=T$. Finally, numerical experiments demonstrate the effectiveness of our boosting methods.  ( 2 min )
    Recurrent Neural Network Training with Convex Loss and Regularization Functions by Extended Kalman Filtering. (arXiv:2111.02673v2 [cs.LG] UPDATED)
    This paper investigates the use of extended Kalman filtering to train recurrent neural networks with rather general convex loss functions and regularization terms on the network parameters, including $\ell_1$-regularization. We show that the learning method outperforms stochastic gradient descent in a nonlinear system identification benchmark and in training a linear system with binary outputs. We also explore the use of the algorithm in data-driven nonlinear model predictive control and its relation with disturbance models for offset-free closed-loop tracking.  ( 2 min )
    Assemblies of neurons learn to classify well-separated distributions. (arXiv:2110.03171v2 [cs.NE] UPDATED)
    An assembly is a large population of neurons whose synchronous firing is hypothesized to represent a memory, concept, word, and other cognitive categories. Assemblies are believed to provide a bridge between high-level cognitive phenomena and low-level neural activity. Recently, a computational system called the Assembly Calculus (AC), with a repertoire of biologically plausible operations on assemblies, has been shown capable of simulating arbitrary space-bounded computation, but also of simulating complex cognitive phenomena such as language, reasoning, and planning. However, the mechanism whereby assemblies can mediate learning has not been known. Here we present such a mechanism, and prove rigorously that, for simple classification problems defined on distributions of labeled assemblies, a new assembly representing each class can be reliably formed in response to a few stimuli from the class; this assembly is henceforth reliably recalled in response to new stimuli from the same class. Furthermore, such class assemblies will be distinguishable as long as the respective classes are reasonably separated -- for example, when they are clusters of similar assemblies. To prove these results, we draw on random graph theory with dynamic edge weights to estimate sequences of activated vertices, yielding strong generalizations of previous calculations and theorems in this field over the past five years. These theorems are backed up by experiments demonstrating the successful formation of assemblies which represent concept classes on synthetic data drawn from such distributions, and also on MNIST, which lends itself to classification through one assembly per digit. Seen as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision -- all key attributes of learning in a model of the brain.  ( 3 min )
    Coarsening the Granularity: Towards Structurally Sparse Lottery Tickets. (arXiv:2202.04736v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis (LTH) has shown that dense models contain highly sparse subnetworks (i.e., winning tickets) that can be trained in isolation to match full accuracy. Despite many exciting efforts being made, there is one "commonsense" rarely challenged: a winning ticket is found by iterative magnitude pruning (IMP) and hence the resultant pruned subnetworks have only unstructured sparsity. That gap limits the appeal of winning tickets in practice, since the highly irregular sparse patterns are challenging to accelerate on hardware. Meanwhile, directly substituting structured pruning for unstructured pruning in IMP damages performance more severely and is usually unable to locate winning tickets. In this paper, we demonstrate the first positive result that a structurally sparse winning ticket can be effectively found in general. The core idea is to append "post-processing techniques" after each round of (unstructured) IMP, to enforce the formation of structural sparsity. Specifically, we first "re-fill" pruned elements back in some channels deemed to be important, and then "re-group" non-zero elements to create flexible group-wise structural patterns. Both our identified channel- and group-wise structural subnetworks win the lottery, with substantial inference speedups readily supported by existing hardware. Extensive experiments, conducted on diverse datasets across multiple network backbones, consistently validate our proposal, showing that the hardware acceleration roadblock of LTH is now removed. Specifically, the structural winning tickets obtain up to {64.93%, 64.84%, 60.23%} running time savings at {36%~80%, 74%, 58%} sparsity on {CIFAR, Tiny-ImageNet, ImageNet}, while maintaining comparable accuracy. Code is at https://github.com/VITA-Group/Structure-LTH.  ( 2 min )
    Preference Communication in Multi-Objective Normal-Form Games. (arXiv:2111.09191v2 [cs.GT] UPDATED)
    We consider preference communication in two-player multi-objective normal-form games. In such games, the payoffs resulting from joint actions are vector-valued. Taking a utility-based approach, we assume there exists a utility function for each player which maps vectors to scalar utilities and consider agents that aim to maximise the utility of expected payoff vectors. As agents typically do not know their opponent's utility function or strategy, they must learn policies to interact with each other. Inspired by Stackelberg games, we introduce four novel preference communication protocols to aid agents in arriving at adequate solutions. Each protocol describes a specific approach for one agent to communicate preferences over their actions and how another agent responds. Additionally, to study when communication emerges, we introduce a communication protocol where agents must learn when to communicate. These protocols are subsequently evaluated on a set of five benchmark games against baseline agents that do not communicate. We find that preference communication can alter the learning process and lead to the emergence of cyclic policies which had not been previously observed in this setting. We further observe that the resulting policies can heavily depend on the characteristics of the game that is played. Lastly, we find that communication naturally emerges in both cooperative and self-interested settings.  ( 2 min )
    Membership-Mappings for Data Representation Learning: Measure Theoretic Conceptualization. (arXiv:2104.07060v3 [cs.LG] UPDATED)
    A fuzzy theoretic analytical approach was recently introduced that leads to efficient and robust models while addressing automatically the typical issues associated to parametric deep models. However, a formal conceptualization of the fuzzy theoretic analytical deep models is still not available. This paper introduces using measure theoretic basis the notion of \emph{membership-mapping} for representing data points through attribute values (motivated by fuzzy theory). A property of the membership-mapping, that can be exploited for data representation learning, is of providing an interpolation on the given data points in the data space. An analytical approach to the variational learning of a membership-mappings based data representation model is considered.  ( 2 min )
    Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks. (arXiv:2112.02845v3 [cs.LG] UPDATED)
    Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets, and use them to examine the usage of the Decision Transformer in the context of MARL. We investigate the generalisation of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to the online fine-tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages transformer's modelling ability of sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A crucial benefit of MADT is that it learns generalisable policies that can transfer between different types of agents under different task scenarios. On StarCraft II offline dataset, MADT outperforms the state-of-the-art offline RL baselines. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency, and enjoys strong performance both few-short and zero-shot cases. To our best knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalisability enhancements in MARL.  ( 3 min )
    Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products. (arXiv:2202.05120v3 [cs.DS] UPDATED)
    We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Here, given access to a matrix $A$ through matrix-vector products, an accuracy parameter $\epsilon$, and a target rank $k$, the goal is to find a rank-$k$ matrix $Z$ with orthonormal columns such that $\| A(I -ZZ^\top)\|_{S_p} \leq (1+\epsilon)\min_{U^\top U = I_k} \|A(I - U U^\top)\|_{S_p}$, where $\|M\|_{S_p}$ denotes the $\ell_p$ norm of the the singular values of $M$. For the special cases of $p=2$ (Frobenius norm) and $p = \infty$ (Spectral norm), Musco and Musco (NeurIPS 2015) obtained an algorithm based on Krylov methods that uses $\tilde{O}(k/\sqrt{\epsilon})$ matrix-vector products, improving on the na\"ive $\tilde{O}(k/\epsilon)$ dependence obtainable by the power method, where $\tilde{O}$ suppresses poly$(\log(dk/\epsilon))$ factors. Our main result is an algorithm that uses only $\tilde{O}(kp^{1/6}/\epsilon^{1/3})$ matrix-vector products, and works for all $p \geq 1$. For $p = 2$ our bound improves the previous $\tilde{O}(k/\epsilon^{1/2})$ bound to $\tilde{O}(k/\epsilon^{1/3})$. Since the Schatten-$p$ and Schatten-$\infty$ norms are the same up to a $(1+ \epsilon)$ factor when $p \geq (\log d)/\epsilon$, our bound recovers the result of Musco and Musco for $p = \infty$. Further, we prove a matrix-vector query lower bound of $\Omega(1/\epsilon^{1/3})$ for any fixed constant $p \geq 1$, showing that surprisingly $\tilde{\Theta}(1/\epsilon^{1/3})$ is the optimal complexity for constant~$k$. To obtain our results, we introduce several new techniques, including optimizing over multiple Krylov subspaces simultaneously, and pinching inequalities for partitioned operators. Our lower bound for $p \in [1,2]$ uses the Araki-Lieb-Thirring trace inequality, whereas for $p>2$, we appeal to a norm-compression inequality for aligned partitioned operators.  ( 2 min )
    Domain Transformer: Predicting Samples of Unseen, Future Domains. (arXiv:2106.06057v2 [cs.LG] UPDATED)
    The data distribution commonly evolves over time leading to problems such as concept drift that often decrease classifier performance. Current techniques are not adequate for this problem because they either require detailed knowledge of the transformation or are not suited for anticipating unseen domains but can only adapt to domains, where data samples are available. We seek to predict unseen data (and their labels) allowing us to tackle challenges s a non-constant data distribution in a proactive manner rather than detecting and reacting to already existing changes that might already have led to errors. To this end, we learn a domain transformer in an unsupervised manner that allows generating data of unseen domains. Our approach first matches independently learned latent representations of two given domains obtained from an auto-encoder using a Cycle-GAN. In turn, a transformation of the original samples can be learned that can be applied iteratively to extrapolate to unseen domains. Our evaluation of CNNs on image data confirms the usefulness of the approach. It also achieves very good results on the well-known problem of unsupervised domain adaption, where only labels but no samples have to be predicted. Code is available at https://github.com/JohnTailor/DoTra.  ( 2 min )
    Deep Learning Based Automated COVID-19 Classification from Computed Tomography Images. (arXiv:2111.11191v3 [eess.IV] UPDATED)
    The paper represents a method of a Convolution Neural Networks (CNN) model for image classification with image preprocessing and hyperparameters tuning, aiming at increasing the predictive performance for COVID-19 diagnosis while avoiding deeper and thus more complex alternatives. Firstly, the CNN model includes four similar convolutional layers followed by a flattening and two dense layers. This work proposes a less complex solution based on simply classifying 2D slices of CT scans using a CNN model. Despite the simplicity in architecture, the proposed CNN model showed improved quantitative results exceeding state-of-the-arts on the dataset of images, in terms of the macro F1 score. The results were achieved on the original CT slices of the dataset. Secondly, the original dataset was processed via anatomy-relevant masking of slices, removing non-representative slices from the CT volume, and hyperparameters tuning. For slice processing, a fixed-sized rectangular area was used for cropping an anatomy-relevant region of interest in the images, and a threshold based on the number of white pixels in binarized slices was employed to remove non-representative slices from the 3D-CT scans. The CNN model with a learning rate schedule with exponential decay and slice flipping techniques was deployed on the processed slices. The proposed method was used to make predictions on the 2D slices. For final diagnosis at a patient level, majority voting was applied on the slices of each CT scan to make the diagnosis. The macro F1 score of the proposed method well exceeded the baseline approach and other alternatives' scores on the validation set as well as on a test partition of the previously unseen images from the COV19-CT-DB dataset partition.  ( 3 min )
    Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization. (arXiv:2201.12250v2 [cs.LG] UPDATED)
    Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful family of approximations are Kronecker-Factored, block-diagonal curvature estimates (KFAC). Here, we combine tools from prior work to evaluate exact second-order updates with careful ablations to establish a surprising result: Due to its approximations, KFAC is not closely related to second-order updates, and in particular, it significantly outperforms true second-order updates. This challenges widely held believes and immediately raises the question why KFAC performs so well. Towards answering this question we present evidence strongly suggesting that KFAC approximates a first-order algorithm, which performs gradient descent on neurons rather than weights. Finally, we show that this optimizer often improves over KFAC in terms of computational cost and data-efficiency.  ( 2 min )
    NeuroComb: Improving SAT Solving with Graph Neural Networks. (arXiv:2110.14053v3 [cs.AI] UPDATED)
    Propositional satisfiability (SAT) is an NP-complete problem that impacts many research fields, such as planning, verification, and security. Mainstream modern SAT solvers are based on the Conflict-Driven Clause Learning (CDCL) algorithm. Recent work aimed to enhance CDCL SAT solvers by improving their variable branching heuristics through predictions generated by Graph Neural Networks(GNNs). However, so far this approach either has not made solving more effective, or has required online access to substantial GPU resources. Aiming to make GNN improvements practical, this paper proposes an approach called NeuroComb, which builds on two insights: (1) predictions of important variables and clauses can be combined with dynamic branching into a more effective hybrid branching strategy, and (2) it is sufficient to query the neural model only once for the predictions before the SAT solving starts. NeuroComb is implemented as an enhancement to a classic CDCL solver called MiniSat and a more recent CDCL solver called Glucose. As a result, it allowed MiniSat to solve 11% and Glucose 5% more problems on the recent SATCOMP-2021 competition problem set, with the computational resource requirement of only one GPU. NeuroComb is therefore a both effective and practical approach to improving SAT solving through machine learning.  ( 2 min )
    Integrated Conditional Estimation-Optimization. (arXiv:2110.12351v2 [stat.ML] UPDATED)
    Many real-world optimization problems involve uncertain parameters with probability distributions that can be estimated using contextual feature information. In contrast to the standard approach of first estimating the distribution of uncertain parameters and then optimizing the objective based on the estimation, we propose an \textit{integrated conditional estimation-optimization} (ICEO) framework that estimates the underlying conditional distribution of the random parameter while considering the structure of the optimization problem. We directly model the relationship between the conditional distribution of the random parameter and the contextual features, and then estimate the probabilistic model with an objective that aligns with the downstream optimization problem. We show that our ICEO approach is asymptotically consistent under moderate regularity conditions and further provide finite performance guarantees in the form of generalization bounds. Computationally, performing estimation with the ICEO approach is a non-convex and often non-differentiable optimization problem. We propose a general methodology for approximating the potentially non-differentiable mapping from estimated conditional distribution to optimal decision by a differentiable function, which greatly improves the performance of gradient-based algorithms applied to the non-convex problem. We also provide a polynomial optimization solution approach in the semi-algebraic case. Numerical experiments are also conducted to show the empirical success of our approach in different situations including with limited data samples and model mismatches.  ( 2 min )
    Contrastive Supervised Distillation for Continual Representation Learning. (arXiv:2205.05476v2 [cs.CV] UPDATED)
    In this paper, we propose a novel training procedure for the continual representation learning problem in which a neural network model is sequentially learned to alleviate catastrophic forgetting in visual search tasks. Our method, called Contrastive Supervised Distillation (CSD), reduces feature forgetting while learning discriminative features. This is achieved by leveraging labels information in a distillation setting in which the student model is contrastively learned from the teacher model. Extensive experiments show that CSD performs favorably in mitigating catastrophic forgetting by outperforming current state-of-the-art methods. Our results also provide further evidence that feature forgetting evaluated in visual retrieval tasks is not as catastrophic as in classification tasks. Code at: https://github.com/NiccoBiondi/ContrastiveSupervisedDistillation.  ( 2 min )
    Neural Bregman Divergences for Distance Learning. (arXiv:2206.04763v1 [cs.LG])
    Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries or appropriateness is often not explored, which we believe is due to a lack of tools for learning non-Euclidean measures of distance. Under the belief that the use of asymmetric methods in particular have lacked sufficient study, we propose a new approach to learning arbitrary Bergman divergences in a differentiable manner via input convex neural networks. Over a set of both new and previously studied tasks, including asymmetric regression, ranking, and clustering, we demonstrate that our method more faithfully learns divergences than prior Bregman learning approaches. In doing so we obtain the first method for learning neural Bregman divergences and with it inherit the many nice mathematical properties of Bregman divergences, providing the foundation and tooling for better developing and studying asymmetric distance learning.  ( 2 min )
    Theoretical Error Performance Analysis for Variational Quantum Circuit Based Functional Regression. (arXiv:2206.04804v1 [quant-ph])
    The noisy intermediate-scale quantum (NISQ) devices enable the implementation of the variational quantum circuit (VQC) for quantum neural networks (QNN). Although the VQC-based QNN has succeeded in many machine learning tasks, the representation and generalization powers of VQC still require further investigation, particularly when the dimensionality reduction of classical inputs is concerned. In this work, we first put forth an end-to-end quantum neural network, namely, TTN-VQC, which consists of a quantum tensor network based on a tensor-train network (TTN) for dimensionality reduction and a VQC for functional regression. Then, we aim at the error performance analysis for the TTN-VQC in terms of representation and generalization powers. We also characterize the optimization properties of TTN-VQC by leveraging the Polyak-Lojasiewicz (PL) condition. Moreover, we conduct the experiments of functional regression on a handwritten digit classification dataset to justify our theoretical analysis.  ( 2 min )
    ReFace: Real-time Adversarial Attacks on Face Recognition Systems. (arXiv:2206.04783v1 [cs.CV])
    Deep neural network based face recognition models have been shown to be vulnerable to adversarial examples. However, many of the past attacks require the adversary to solve an input-dependent optimization problem using gradient descent which makes the attack impractical in real-time. These adversarial examples are also tightly coupled to the attacked model and are not as successful in transferring to different models. In this work, we propose ReFace, a real-time, highly-transferable attack on face recognition models based on Adversarial Transformation Networks (ATNs). ATNs model adversarial example generation as a feed-forward neural network. We find that the white-box attack success rate of a pure U-Net ATN falls substantially short of gradient-based attacks like PGD on large face recognition datasets. We therefore propose a new architecture for ATNs that closes this gap while maintaining a 10000x speedup over PGD. Furthermore, we find that at a given perturbation magnitude, our ATN adversarial perturbations are more effective in transferring to new face recognition models than PGD. ReFace attacks can successfully deceive commercial face recognition services in a transfer attack setting and reduce face identification accuracy from 82% to 16.4% for AWS SearchFaces API and Azure face verification accuracy from 91% to 50.1%.  ( 2 min )
    A Novel Partitioned Approach for Reduced Order Model -- Finite Element Model (ROM-FEM) and ROM-ROM Coupling. (arXiv:2206.04736v1 [math.NA])
    Partitioned methods allow one to build a simulation capability for coupled problems by reusing existing single-component codes. In so doing, partitioned methods can shorten code development and validation times for multiphysics and multiscale applications. In this work, we consider a scenario in which one or more of the "codes" being coupled are projection-based reduced order models (ROMs), introduced to lower the computational cost associated with a particular component. We simulate this scenario by considering a model interface problem that is discretized independently on two non-overlapping subdomains. We then formulate a partitioned scheme for this problem that allows the coupling between a ROM "code" for one of the subdomains with a finite element model (FEM) or ROM "code" for the other subdomain. The ROM "codes" are constructed by performing proper orthogonal decomposition (POD) on a snapshot ensemble to obtain a low-dimensional reduced order basis, followed by a Galerkin projection onto this basis. The ROM and/or FEM "codes" on each subdomain are then coupled using a Lagrange multiplier representing the interface flux. To partition the resulting monolithic problem, we first eliminate the flux through a dual Schur complement. Application of an explicit time integration scheme to the transformed monolithic problem decouples the subdomain equations, allowing their independent solution for the next time step. We show numerical results that demonstrate the proposed method's efficacy in achieving both ROM-FEM and ROM-ROM coupling.  ( 2 min )
    Mobility Improves the Convergence of Asynchronous Federated Learning. (arXiv:2206.04742v1 [cs.LG])
    This paper studies asynchronous Federated Learning (FL) subject to clients' individual arbitrary communication patterns with the parameter server. We propose FedMobile, a new asynchronous FL algorithm that exploits the mobility attribute of the mobile FL system to improve the learning performance. The key idea is to leverage the random client-to-client communication in a mobile network to create additional indirect communication opportunities with the server via upload and download relaying. We prove that FedMobile achieves a convergence rate $O(\frac{1}{\sqrt{NT}})$, where $N$ is the number of clients and $T$ is the number of communication slots, and show that the optimal design involves an interesting trade-off on the best timing of relaying. Our analysis suggests that with an increased level of mobility, asynchronous FL converges faster using FedMobile. Experiment results on a synthetic dataset and two real-world datasets verify our theoretical findings.  ( 2 min )
    STNDT: Modeling Neural Population Activity with a Spatiotemporal Transformer. (arXiv:2206.04727v1 [q-bio.NC])
    Modeling neural population dynamics underlying noisy single-trial spiking activities is essential for relating neural observation and behavior. A recent non-recurrent method - Neural Data Transformers (NDT) - has shown great success in capturing neural dynamics with low inference latency without an explicit dynamical model. However, NDT focuses on modeling the temporal evolution of the population activity while neglecting the rich covariation between individual neurons. In this paper we introduce SpatioTemporal Neural Data Transformer (STNDT), an NDT-based architecture that explicitly models responses of individual neurons in the population across time and space to uncover their underlying firing rates. In addition, we propose a contrastive learning loss that works in accordance with mask modeling objective to further improve the predictive performance. We show that our model achieves state-of-the-art performance on ensemble level in estimating neural activities across four neural datasets, demonstrating its capability to capture autonomous and non-autonomous dynamics spanning different cortical regions while being completely agnostic to the specific behaviors at hand. Furthermore, STNDT spatial attention mechanism reveals consistently important subsets of neurons that play a vital role in driving the response of the entire population, providing interpretability and key insights into how the population of neurons performs computation.  ( 2 min )
    AI-based Clinical Assessment of Optic Nerve Head Robustness Superseding Biomechanical Testing. (arXiv:2206.04689v1 [eess.IV])
    $\mathbf{Purpose}$: To use artificial intelligence (AI) to: (1) exploit biomechanical knowledge of the optic nerve head (ONH) from a relatively large population; (2) assess ONH robustness from a single optical coherence tomography (OCT) scan of the ONH; (3) identify what critical three-dimensional (3D) structural features make a given ONH robust. $\mathbf{Design}$: Retrospective cross-sectional study. $\mathbf{Methods}$: 316 subjects had their ONHs imaged with OCT before and after acute intraocular pressure (IOP) elevation through ophthalmo-dynamometry. IOP-induced lamina-cribrosa deformations were then mapped in 3D and used to classify ONHs. Those with LC deformations superior to 4% were considered fragile, while those with deformations inferior to 4% robust. Learning from these data, we compared three AI algorithms to predict ONH robustness strictly from a baseline (undeformed) OCT volume: (1) a random forest classifier; (2) an autoencoder; and (3) a dynamic graph CNN (DGCNN). The latter algorithm also allowed us to identify what critical 3D structural features make a given ONH robust. $\mathbf{Results}$: All 3 methods were able to predict ONH robustness from 3D structural information alone and without the need to perform biomechanical testing. The DGCNN (area under the receiver operating curve [AUC]: 0.76 $\pm$ 0.08) outperformed the autoencoder (AUC: 0.70 $\pm$ 0.07) and the random forest classifier (AUC: 0.69 $\pm$ 0.05). Interestingly, to assess ONH robustness, the DGCNN mainly used information from the scleral canal and the LC insertion sites. $\mathbf{Conclusions}$: We propose an AI-driven approach that can assess the robustness of a given ONH solely from a single OCT scan of the ONH, and without the need to perform biomechanical testing. Longitudinal studies should establish whether ONH robustness could help us identify fast visual field loss progressors.  ( 2 min )
    ReCo: A Dataset for Residential Community Layout Planning. (arXiv:2206.04678v1 [cs.LG])
    Layout planning is centrally important in the field of architecture and urban design. Among the various basic units carrying urban functions, residential community plays a vital part for supporting human life. Therefore, the layout planning of residential community has always been of concern, and has attracted particular attention since the advent of deep learning that facilitates the automated layout generation and spatial pattern recognition. However, the research circles generally suffer from the insufficiency of residential community layout benchmark or high-quality datasets, which hampers the future exploration of data-driven methods for residential community layout planning. The lack of datasets is largely due to the difficulties of large-scale real-world residential data acquisition and long-term expert screening. In order to address the issues and advance a benchmark dataset for various intelligent spatial design and analysis applications in the development of smart city, we introduce Residential Community Layout Planning (ReCo) Dataset, which is the first and largest open-source vector dataset related to real-world community to date. ReCo Dataset is presented in multiple data formats with 37,646 residential community layout plans, covering 598,728 residential buildings with height information. ReCo can be conveniently adapted for residential community layout related urban design tasks, e.g., generative layout design, morphological pattern recognition and spatial evaluation. To validate the utility of ReCo in automated residential community layout planning, a Generative Adversarial Network (GAN) based generative model is further applied to the dataset. We expect ReCo Dataset to inspire more creative and practical work in intelligent design and beyond. The ReCo Dataset is published at: https://www.kaggle.com/fdudsde/reco-dataset.  ( 2 min )
    Extending Momentum Contrast with Cross Similarity Consistency Regularization. (arXiv:2206.04676v1 [cs.LG])
    Contrastive self-supervised representation learning methods maximize the similarity between the positive pairs, and at the same time tend to minimize the similarity between the negative pairs. However, in general the interplay between the negative pairs is ignored as they do not put in place special mechanisms to treat negative pairs differently according to their specific differences and similarities. In this paper, we present Extended Momentum Contrast (XMoCo), a self-supervised representation learning method founded upon the legacy of the momentum-encoder unit proposed in the MoCo family configurations. To this end, we introduce a cross consistency regularization loss, with which we extend the transformation consistency to dissimilar images (negative pairs). Under the cross consistency regularization rule, we argue that semantic representations associated with any pair of images (positive or negative) should preserve their cross-similarity under pretext transformations. Moreover, we further regularize the training loss by enforcing a uniform distribution of similarity over the negative pairs across a batch. The proposed regularization can easily be added to existing self-supervised learning algorithms in a plug-and-play fashion. Empirically, we report a competitive performance on the standard Imagenet-1K linear head classification benchmark. In addition, by transferring the learned representations to common downstream tasks, we show that using XMoCo with the prevalently utilized augmentations can lead to improvements in the performance of such tasks. We hope the findings of this paper serve as a motivation for researchers to take into consideration the important interplay among the negative examples in self-supervised learning.  ( 2 min )
    Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream. (arXiv:2206.04792v1 [cs.LG])
    Online anomaly detection from a data stream is critical for the safety and security of many applications but is facing severe challenges due to complex and evolving data streams from IoT devices and cloud-based infrastructures. Unfortunately, existing approaches fall too short for these challenges; online anomaly detection methods bear the burden of handling the complexity while offline deep anomaly detection methods suffer from the evolving data distribution. This paper presents a framework for online deep anomaly detection, ARCUS, which can be instantiated with any autoencoder-based deep anomaly detection methods. It handles the complex and evolving data streams using an adaptive model pooling approach with two novel techniques: concept-driven inference and drift-aware model pool update; the former detects anomalies with a combination of models most appropriate for the complexity, and the latter adapts the model pool dynamically to fit the evolving data streams. In comprehensive experiments with ten data sets which are both high-dimensional and concept-drifted, ARCUS improved the anomaly detection accuracy of the streaming variants of state-of-the-art autoencoder-based methods and that of the state-of-the-art streaming anomaly detection methods by up to 22% and 37%, respectively.  ( 2 min )
    Unsupervised Deep Discriminant Analysis Based Clustering. (arXiv:2206.04686v1 [cs.LG])
    This work presents an unsupervised deep discriminant analysis for clustering. The method is based on deep neural networks and aims to minimize the intra-cluster discrepancy and maximize the inter-cluster discrepancy in an unsupervised manner. The method is able to project the data into a nonlinear low-dimensional latent space with compact and distinct distribution patterns such that the data clusters can be effectively identified. We further provide an extension of the method such that available graph information can be effectively exploited to improve the clustering performance. Extensive numerical results on image and non-image data with or without graph information demonstrate the effectiveness of the proposed methods.  ( 2 min )
    On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data. (arXiv:2206.04723v1 [cs.LG])
    Existing theory predicts that data heterogeneity will degrade the performance of the Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the simple FedAvg algorithm converges very well. This paper explains the seemingly unreasonable effectiveness of FedAvg that contradicts the previous theoretical predictions. We find that the key assumption of bounded gradient dissimilarity in previous theoretical analyses is too pessimistic to characterize data heterogeneity in practical applications. For a simple quadratic problem, we demonstrate there exist regimes where large gradient dissimilarity does not have any negative impact on the convergence of FedAvg. Motivated by this observation, we propose a new quantity, average drift at optimum, to measure the effects of data heterogeneity, and explicitly use it to present a new theoretical analysis of FedAvg. We show that the average drift at optimum is nearly zero across many real-world federated training tasks, whereas the gradient dissimilarity can be large. And our new analysis suggests FedAvg can have identical convergence rates in homogeneous and heterogeneous data settings, and hence, leads to better understanding of its empirical success.  ( 2 min )
    Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference. (arXiv:2206.04685v1 [cs.LG])
    By adding exiting layers to the deep learning networks, early exit can terminate the inference earlier with accurate results. The passive decision-making of whether to exit or continue the next layer has to go through every pre-placed exiting layer until it exits. In addition, it is also hard to adjust the configurations of the computing platforms alongside the inference proceeds. By incorporating a low-cost prediction engine, we propose a Predictive Exit framework for computation- and energy-efficient deep learning applications. Predictive Exit can forecast where the network will exit (i.e., establish the number of remaining layers to finish the inference), which effectively reduces the network computation cost by exiting on time without running every pre-placed exiting layer. Moreover, according to the number of remaining layers, proper computing configurations (i.e., frequency and voltage) are selected to execute the network to further save energy. Extensive experimental results demonstrate that Predictive Exit achieves up to 96.2% computation reduction and 72.9% energy-saving compared with classic deep learning networks; and 12.8% computation reduction and 37.6% energy-saving compared with the early exit under state-of-the-art exiting strategies, given the same inference accuracy and latency.  ( 2 min )
    An Empirical Study on Disentanglement of Negative-free Contrastive Learning. (arXiv:2206.04756v1 [cs.LG])
    Negative-free contrastive learning has attracted a lot of attention with simplicity and impressive performance for large-scale pretraining. But its disentanglement property remains unexplored. In this paper, we take different negative-free contrastive learning methods to study the disentanglement property of this genre of self-supervised methods empirically. We find the existing disentanglement metrics fail to make meaningful measurements for the high-dimensional representation model so we propose a new disentanglement metric based on Mutual Information between representation and data factors. With the proposed metric, we benchmark the disentanglement property of negative-free contrastive learning for the first time, on both popular synthetic datasets and a real-world dataset CelebA. Our study shows that the investigated methods can learn a well-disentangled subset of representation. We extend the study of the disentangled representation learning to high-dimensional representation space and negative-free contrastive learning for the first time. The implementation of the proposed metric is available at \url{https://github.com/noahcao/disentanglement_lib_med}.  ( 2 min )
    POODLE: Improving Few-shot Learning via Penalizing Out-of-Distribution Samples. (arXiv:2206.04679v1 [cs.LG])
    In this work, we propose to use out-of-distribution samples, i.e., unlabeled samples coming from outside the target classes, to improve few-shot learning. Specifically, we exploit the easily available out-of-distribution samples to drive the classifier to avoid irrelevant features by maximizing the distance from prototypes to out-of-distribution samples while minimizing that of in-distribution samples (i.e., support, query data). Our approach is simple to implement, agnostic to feature extractors, lightweight without any additional cost for pre-training, and applicable to both inductive and transductive settings. Extensive experiments on various standard benchmarks demonstrate that the proposed method consistently improves the performance of pretrained networks with different architectures.  ( 2 min )
    A Learning-Theoretic Framework for Certified Auditing of Machine Learning Models. (arXiv:2206.04740v1 [cs.LG])
    Responsible use of machine learning requires that models be audited for undesirable properties. However, how to do principled auditing in a general setting has remained ill-understood. In this paper, we propose a formal learning-theoretic framework for auditing. We propose algorithms for auditing linear classifiers for feature sensitivity using label queries as well as different kinds of explanations, and provide performance guarantees. Our results illustrate that while counterfactual explanations can be extremely helpful for auditing, anchor explanations may not be as beneficial in the worst case.  ( 2 min )
    A Neural Network Architecture for Program Understanding Inspired by Human Behaviors. (arXiv:2206.04730v1 [cs.SE])
    Program understanding is a fundamental task in program language processing. Despite the success, existing works fail to take human behaviors as reference in understanding programs. In this paper, we consider human behaviors and propose the PGNN-EK model that consists of two main components. On the one hand, inspired by the "divide-and-conquer" reading behaviors of humans, we present a partitioning-based graph neural network model PGNN on the upgraded AST of codes. On the other hand, to characterize human behaviors of resorting to other resources to help code comprehension, we transform raw codes with external knowledge and apply pre-training techniques for information extraction. Finally, we combine the two embeddings generated from the two components to output code embeddings. We conduct extensive experiments to show the superior performance of PGNN-EK on the code summarization and code clone detection tasks. In particular, to show the generalization ability of our model, we release a new dataset that is more challenging for code clone detection and could advance the development of the community. Our codes and data are publicly available at https://github.com/RecklessRonan/PGNN-EK.  ( 2 min )
    Explainable Artificial Intelligence (XAI) for Internet of Things: A Survey. (arXiv:2206.04800v1 [cs.AI])
    Black-box nature of Artificial Intelligence (AI) models do not allow users to comprehend and sometimes trust the output created by such model. In AI applications, where not only the results but also the decision paths to the results are critical, such black-box AI models are not sufficient. Explainable Artificial Intelligence (XAI) addresses this problem and defines a set of AI models that are interpretable by the users. Recently, several number of XAI models have been to address the issues surrounding by lack of interpretability and explainability of black-box models in various application areas such as healthcare, military, energy, financial and industrial domains. Although the concept of XAI has gained great deal of attention recently, its integration into the IoT domain has not yet been fully defined. In this paper, we provide an in-depth and systematic review of recent studies using XAI models in the scope of IoT domain. We categorize the studies according to their methodology and applications areas. In addition, we aim to focus on the challenging problems and open issues and give future directions to guide the developers and researchers for prospective future investigations.  ( 2 min )
    Can Backdoor Attacks Survive Time-Varying Models?. (arXiv:2206.04677v1 [cs.CR])
    Backdoors are powerful attacks against deep neural networks (DNNs). By poisoning training data, attackers can inject hidden rules (backdoors) into DNNs, which only activate on inputs containing attack-specific triggers. While existing work has studied backdoor attacks on a variety of DNN models, they only consider static models, which remain unchanged after initial deployment. In this paper, we study the impact of backdoor attacks on a more realistic scenario of time-varying DNN models, where model weights are updated periodically to handle drifts in data distribution over time. Specifically, we empirically quantify the "survivability" of a backdoor against model updates, and examine how attack parameters, data drift behaviors, and model update strategies affect backdoor survivability. Our results show that one-shot backdoor attacks (i.e., only poisoning training data once) do not survive past a few model updates, even when attackers aggressively increase trigger size and poison ratio. To stay unaffected by model update, attackers must continuously introduce corrupted data into the training pipeline. Together, these results indicate that when models are updated to learn new data, they also "forget" backdoors as hidden, malicious features. The larger the distribution shift between old and new training data, the faster backdoors are forgotten. Leveraging these insights, we apply a smart learning rate scheduler to further accelerate backdoor forgetting during model updates, which prevents one-shot backdoors from surviving past a single model update.  ( 2 min )
    RT-DNAS: Real-time Constrained Differentiable Neural Architecture Search for 3D Cardiac Cine MRI Segmentation. (arXiv:2206.04682v1 [eess.IV])
    Accurately segmenting temporal frames of cine magnetic resonance imaging (MRI) is a crucial step in various real-time MRI guided cardiac interventions. To achieve fast and accurate visual assistance, there are strict requirements on the maximum latency and minimum throughput of the segmentation framework. State-of-the-art neural networks on this task are mostly hand-crafted to satisfy these constraints while achieving high accuracy. On the other hand, while existing literature have demonstrated the power of neural architecture search (NAS) in automatically identifying the best neural architectures for various medical applications, they are mostly guided by accuracy, sometimes with computation complexity, and the importance of real-time constraints are overlooked. A major challenge is that such constraints are non-differentiable and are thus not compatible with the widely used differentiable NAS frameworks. In this paper, we present a strategy that directly handles real-time constraints in a differentiable NAS framework named RT-DNAS. Experiments on extended 2017 MICCAI ACDC dataset show that compared with state-of-the-art manually and automatically designed architectures, RT-DNAS is able to identify ones with better accuracy while satisfying the real-time constraints.  ( 2 min )
  • Open

    Refined Convergence and Topology Learning for Decentralized Optimization with Heterogeneous Data. (arXiv:2204.04452v2 [cs.LG] UPDATED)
    One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called \emph{neighborhood heterogeneity}, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts in decentralized learning. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.  ( 2 min )
    Trace norm regularization for multi-task learning with scarce data. (arXiv:2202.06742v2 [stat.ML] UPDATED)
    Multi-task learning leverages structural similarities between multiple tasks to learn despite very few samples. Motivated by the recent success of neural networks applied to data-scarce tasks, we consider a linear low-dimensional shared representation model. Despite an extensive literature, existing theoretical results either guarantee weak estimation rates or require a large number of samples per task. This work provides the first estimation error bound for the trace norm regularized estimator when the number of samples per task is small. The advantages of trace norm regularization for learning data-scarce tasks extend to meta-learning and are confirmed empirically on synthetic datasets.  ( 2 min )
    Meta Optimal Transport. (arXiv:2206.05262v1 [cs.LG])
    We study the use of amortized optimization to predict optimal transport (OT) maps from the input measures, which we call Meta OT. This helps repeatedly solve similar OT problems between different measures by leveraging the knowledge and information present from past problems to rapidly predict and solve new problems. Otherwise, standard methods ignore the knowledge of the past solutions and suboptimally re-solve each problem from scratch. Meta OT models surpass the standard convergence rates of log-Sinkhorn solvers in the discrete setting and convex potentials in the continuous setting. We improve the computational time of standard OT solvers by multiple orders of magnitude in discrete and continuous transport settings between images, spherical data, and color palettes. Our source code is available at this http URL  ( 2 min )
    On the safe use of prior densities for Bayesian model selection. (arXiv:2206.05210v1 [stat.ME])
    The application of Bayesian inference for the purpose of model selection is very popular nowadays. In this framework, models are compared through their marginal likelihoods, or their quotients, called Bayes factors. However, marginal likelihoods depends on the prior choice. For model selection, even diffuse priors can be actually very informative, unlike for the parameter estimation problem. Furthermore, when the prior is improper, the marginal likelihood of the corresponding model is undetermined. In this work, we discuss the issue of prior sensitivity of the marginal likelihood and its role in model selection. We also comment on the use of uninformative priors, which are very common choices in practice. Several practical suggestions are discussed and many possible solutions, proposed in the literature, to design objective priors for model selection are described. Some of them also allow the use of improper priors. The connection between the marginal likelihood approach and the well-known information criteria is also presented. We describe the main issues and possible solutions by illustrative numerical examples, providing also some related code. One of them involving a real-world application on exoplanet detection.  ( 2 min )
    Tackling covariate shift with node-based Bayesian neural networks. (arXiv:2206.02435v2 [stat.ML] UPDATED)
    Bayesian neural networks (BNNs) promise improved generalization under covariate shift by providing principled probabilistic representations of epistemic uncertainty. However, weight-based BNNs often struggle with high computational complexity of large-scale architectures and datasets. Node-based BNNs have recently been introduced as scalable alternatives, which induce epistemic uncertainty by multiplying each hidden node with latent random variables, while learning a point-estimate of the weights. In this paper, we interpret these latent noise variables as implicit representations of simple and domain-agnostic data perturbations during training, producing BNNs that perform well under covariate shift due to input corruptions. We observe that the diversity of the implicit corruptions depends on the entropy of the latent variables, and propose a straightforward approach to increase the entropy of these variables during training. We evaluate the method on out-of-distribution image classification benchmarks, and show improved uncertainty estimation of node-based BNNs under covariate shift due to input perturbations. As a side effect, the method also provides robustness against noisy training labels.  ( 2 min )
    Asymptotic Escape of Spurious Critical Points on the Low-rank Matrix Manifold. (arXiv:2107.09207v2 [math.OC] UPDATED)
    We show that on the manifold of fixed-rank and symmetric positive semi-definite matrices, the Riemannian gradient descent algorithm almost surely escapes some spurious critical points on the boundary of the manifold. Our result is the first to partially overcome the incompleteness of the low-rank matrix manifold without changing the vanilla Riemannian gradient descent algorithm. The spurious critical points are some rank-deficient matrices that capture only part of the eigen components of the ground truth. Unlike classical strict saddle points, they exhibit very singular behavior. We show that using the dynamical low-rank approximation and a rescaled gradient flow, some of the spurious critical points can be converted to classical strict saddle points in the parameterized domain, which leads to the desired result. Numerical experiments are provided to support our theoretical findings.  ( 2 min )
    Popularity Adjusted Block Models are Generalized Random Dot Product Graphs. (arXiv:2109.04010v2 [stat.ML] UPDATED)
    We connect two random graph models, the Popularity Adjusted Block Model (PABM) and the Generalized Random Dot Product Graph (GRDPG), by demonstrating that the PABM is a special case of the GRDPG in which communities correspond to mutually orthogonal subspaces of latent vectors. This insight allows us to construct new algorithms for community detection and parameter estimation for the PABM, as well as improve an existing algorithm that relies on Sparse Subspace Clustering. Using established asymptotic properties of Adjacency Spectral Embedding for the GRDPG, we derive asymptotic properties of these algorithms. In particular, we demonstrate that the absolute number of community detection errors tends to zero as the number of graph vertices tends to infinity. Simulation experiments illustrate these properties.  ( 2 min )
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v2 [cs.LG] UPDATED)
    Parameter estimation in empirical fields is usually undertaken using parametric models, and such models readily facilitate statistical inference. Unfortunately, they are unlikely to be sufficiently flexible to be able to adequately model real-world phenomena, and may yield biased estimates. Conversely, non-parametric approaches are flexible but do not readily facilitate statistical inference and may still exhibit residual bias. We explore the potential for Influence Functions (IFs) to (a) improve initial estimators without needing more data (b) increase model robustness and (c) facilitate statistical inference. We begin with a broad introduction to IFs, and propose a neural network method 'MultiNet', which seeks the diversity of an ensemble using a single architecture. We also introduce variants on the IF update step which we call 'MultiStep', and provide a comprehensive evaluation of different approaches. The improvements are found to be dataset dependent, indicating an interaction between the methods used and nature of the data generating process. Our experiments highlight the need for practitioners to check the consistency of their findings, potentially by undertaking multiple analyses with different combinations of estimators. We also show that it is possible to improve existing neural networks for `free', without needing more data, and without needing to retrain them.  ( 2 min )
    GD-VAEs: Geometric Dynamic Variational Autoencoders for Learning Nonlinear Dynamics and Dimension Reductions. (arXiv:2206.05183v1 [cs.LG])
    We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. We develop approaches for learning nonlinear state space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Transpose CNNs (T-CNNs). Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning low dimensional representations of the nonlinear Burgers equations, constrained mechanical systems, and spatial fields of reaction-diffusion systems. GD-VAEs provide methods for obtaining representations for use in learning tasks involving dynamics.  ( 2 min )
    Dynamic mean field programming. (arXiv:2206.05200v1 [stat.ML])
    A dynamic mean field theory is developed for model based Bayesian reinforcement learning in the large state space limit. In an analogy with the statistical physics of disordered systems, the transition probabilities are interpreted as couplings, and value functions as deterministic spins, and thus the sampled transition probabilities are considered to be quenched random variables. The results reveal that, under standard assumptions, the posterior over Q-values is asymptotically independent and Gaussian across state-action pairs, for infinite horizon problems. The finite horizon case exhibits the same behaviour for all state-actions pairs at each time but has an additional correlation across time, for each state-action pair. The results also hold for policy evaluation. The Gaussian statistics can be computed from a set of coupled mean field equations derived from the Bellman equation, which we call dynamic mean field programming (DMFP). For Q-value iteration, approximate equations are obtained by appealing to extreme value theory, and closed form expressions are found in the independent and identically distributed case. The Lyapunov stability of these closed form equations is studied.  ( 2 min )
    Street Crossing Aid Using Light-weight CNNs for the Visually Impaired. (arXiv:1909.09598v2 [cs.CV] UPDATED)
    In this paper, we address an issue that the visually impaired commonly face while crossing intersections and propose a solution that takes form as a mobile application. The application utilizes a deep learning convolutional neural network model, LytNetV2, to output necessary information that the visually impaired may lack when without human companions or guide-dogs. A prototype of the application runs on iOS devices of versions 11 or above. It is designed for comprehensiveness, concision, accuracy, and computational efficiency through delivering the two most important pieces of information, pedestrian traffic light color and direction, required to cross the road in real-time. Furthermore, it is specifically aimed to support those facing financial burden as the solution takes the form of a free mobile application. Through the modification and utilization of key principles in MobileNetV3 such as depthwise seperable convolutions and squeeze-excite layers, the deep neural network model achieves a classification accuracy of 96% and average angle error of 6.15 degrees, while running at a frame rate of 16.34 frames per second. Additionally, the model is trained as an image classifier, allowing for a faster and more accurate model. The network is able to outperform other methods such as object detection and non-deep learning algorithms in both accuracy and thoroughness. The information is delivered through both auditory signals and vibrations, and it has been tested on seven visually impaired and has received above satisfactory responses.  ( 2 min )
    Interactively Learning Preference Constraints in Linear Bandits. (arXiv:2206.05255v1 [cs.LG])
    We study sequential decision-making with known rewards and unknown constraints, motivated by situations where the constraints represent expensive-to-evaluate human preferences, such as safe and comfortable driving behavior. We formalize the challenge of interactively learning about these constraints as a novel linear bandit problem which we call constrained linear best-arm identification. To solve this problem, we propose the Adaptive Constraint Learning (ACOL) algorithm. We provide an instance-dependent lower bound for constrained linear best-arm identification and show that ACOL's sample complexity matches the lower bound in the worst-case. In the average case, ACOL's sample complexity bound is still significantly tighter than bounds of simpler approaches. In synthetic experiments, ACOL performs on par with an oracle solution and outperforms a range of baselines. As an application, we consider learning constraints to represent human preferences in a driving simulation. ACOL is significantly more sample efficient than alternatives for this application. Further, we find that learning preferences as constraints is more robust to changes in the driving scenario than encoding the preferences directly in the reward function.  ( 2 min )
    Integrated Conditional Estimation-Optimization. (arXiv:2110.12351v2 [stat.ML] UPDATED)
    Many real-world optimization problems involve uncertain parameters with probability distributions that can be estimated using contextual feature information. In contrast to the standard approach of first estimating the distribution of uncertain parameters and then optimizing the objective based on the estimation, we propose an \textit{integrated conditional estimation-optimization} (ICEO) framework that estimates the underlying conditional distribution of the random parameter while considering the structure of the optimization problem. We directly model the relationship between the conditional distribution of the random parameter and the contextual features, and then estimate the probabilistic model with an objective that aligns with the downstream optimization problem. We show that our ICEO approach is asymptotically consistent under moderate regularity conditions and further provide finite performance guarantees in the form of generalization bounds. Computationally, performing estimation with the ICEO approach is a non-convex and often non-differentiable optimization problem. We propose a general methodology for approximating the potentially non-differentiable mapping from estimated conditional distribution to optimal decision by a differentiable function, which greatly improves the performance of gradient-based algorithms applied to the non-convex problem. We also provide a polynomial optimization solution approach in the semi-algebraic case. Numerical experiments are also conducted to show the empirical success of our approach in different situations including with limited data samples and model mismatches.  ( 2 min )
    Linear regression with partially mismatched data: local search with theoretical guarantees. (arXiv:2106.02175v2 [math.OC] UPDATED)
    Linear regression is a fundamental modeling tool in statistics and related fields. In this paper, we study an important variant of linear regression in which the predictor-response pairs are partially mismatched. We use an optimization formulation to simultaneously learn the underlying regression coefficients and the permutation corresponding to the mismatches. The combinatorial structure of the problem leads to computational challenges. We propose and study a simple greedy local search algorithm for this optimization problem that enjoys strong theoretical guarantees and appealing computational performance. We prove that under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on problem data; our local search algorithm converges to a nearly-optimal solution at a linear rate. In particular, in the noiseless case, our algorithm converges to the global optimal solution with a linear convergence rate. Based on this result, we prove an upper bound for the estimation error of the parameter. We also propose an approximate local search step that allows us to scale our approach to much larger instances. We conduct numerical experiments to gather further insights into our theoretical results, and show promising performance gains compared to existing approaches.  ( 2 min )
    Mixed Logit Models and Network Formation. (arXiv:2006.16516v4 [cs.SI] UPDATED)
    The study of network formation is pervasive in economics, sociology, and many other fields. In this paper, we model network formation as a `choice' that is made by nodes in a network to connect to other nodes. We study these `choices' using discrete-choice models, in which an agent chooses between two or more discrete alternatives. We employ the `repeated-choice' (RC) model to study network formation. We argue that the RC model overcomes important limitations of the multinomial logit (MNL) model, which gives one framework for studying network formation, and that it is well-suited to study network formation. We also illustrate how to use the RC model to accurately study network formation using both synthetic and real-world networks. Using synthetic networks, we also compare the performance of the MNL model and the RC model. We find that the RC model estimates the data-generation process of our synthetic networks more accurately than the MNL model. We do a case study of a qualitatively interesting scenario -- the fact that new patents are more likely to cite older, more cited, and similar patents -- for which the RC model allows us to achieve interesting insights.  ( 2 min )
    Learning Classifiers under Delayed Feedback with a Time Window Assumption. (arXiv:2009.13092v2 [cs.LG] UPDATED)
    We consider training a binary classifier under delayed feedback (\emph{DF learning}). For example, in the conversion prediction in online ads, we initially receive negative samples that clicked the ads but did not buy an item; subsequently, some samples among them buy an item then change to positive. In the setting of DF learning, we observe samples over time, then learn a classifier at some point. We initially receive negative samples; subsequently, some samples among them change to positive. This problem is conceivable in various real-world applications such as online advertisements, where the user action takes place long after the first click. Owing to the delayed feedback, naive classification of the positive and negative samples returns a biased classifier. One solution is to use samples that have been observed for more than a certain time window assuming these samples are correctly labeled. However, existing studies reported that simply using a subset of all samples based on the time window assumption does not perform well, and that using all samples along with the time window assumption improves empirical performance. We extend these existing studies and propose a method with the unbiased and convex empirical risk that is constructed from all samples under the time window assumption. To demonstrate the soundness of the proposed method, we provide experimental results on a synthetic and open dataset that is the real traffic log datasets in online advertising.  ( 2 min )
    List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering. (arXiv:2206.05245v1 [cs.DS])
    We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\widehat \mu$ such that $\| \widehat \mu - \mu \|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ``certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\mathrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.  ( 2 min )
    Validity, consonant plausibility measures, and conformal prediction. (arXiv:2001.09225v3 [math.ST] UPDATED)
    Prediction of future observations is an important and challenging problem. The two mainstream approaches for quantifying prediction uncertainty use prediction regions and predictive distributions, respectively, with the latter believed to be more informative because it can perform other prediction-related tasks. The standard notion of validity, what we refer to here as Type-1 validity, focuses on coverage probability of prediction regions, while a notion of validity relevant to the other prediction-related tasks performed by predictive distributions is lacking. Here we present a new notion, called Type-2 validity, relevant to these other prediction tasks. We establish connections between Type-2 validity and coherence properties, and show that imprecise probability considerations are required in order to achieve it. We go on to show that both types of prediction validity can be achieved by interpreting the conformal prediction output as the contour function of a consonant plausibility measure. We also offer an alternative characterization of conformal prediction, based on a new nonparametric inferential model construction, wherein the appearance of consonance is natural, and prove its validity.  ( 2 min )
    Log-concave density estimation in undirected graphical models. (arXiv:2206.05227v1 [math.ST])
    We study the problem of maximum likelihood estimation of densities that are log-concave and lie in the graphical model corresponding to a given undirected graph $G$. We show that the maximum likelihood estimate (MLE) is the product of the exponentials of several tent functions, one for each maximal clique of $G$. While the set of log-concave densities in a graphical model is infinite-dimensional, our results imply that the MLE can be found by solving a finite-dimensional convex optimization problem. We provide an implementation and a few examples. Furthermore, we show that the MLE exists and is unique with probability 1 as long as the number of sample points is larger than the size of the largest clique of $G$ when $G$ is chordal. We show that the MLE is consistent when the graph $G$ is a disjoint union of cliques. Finally, we discuss the conditions under which a log-concave density in the graphical model of $G$ has a log-concave factorization according to $G$.  ( 2 min )
    Hierarchical mixtures of Gaussians for combined dimensionality reduction and clustering. (arXiv:2206.04841v1 [cs.LG])
    To avoid the curse of dimensionality, a common approach to clustering high-dimensional data is to first project the data into a space of reduced dimension, and then cluster the projected data. Although effective, this two-stage approach prevents joint optimization of the dimensionality-reduction and clustering models, and obscures how well the complete model describes the data. Here, we show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG). An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function. By formulating and extending existing models with exponential family theory, we show how to maximize the likelihood of HMoGs with expectation-maximization. We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models. Ultimately, HMoGs are a rigorous generalization of a common statistical framework, and provide researchers with a method to improve model performance when clustering high-dimensional data.  ( 2 min )
    Cross-validation: what does it estimate and how well does it do it?. (arXiv:2104.00673v3 [stat.ME] UPDATED)
    Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.  ( 2 min )
    Trimmed Maximum Likelihood Estimation for Robust Learning in Generalized Linear Models. (arXiv:2206.04777v1 [cs.LG])
    We study the problem of learning generalized linear models under adversarial corruptions. We analyze a classical heuristic called the iterative trimmed maximum likelihood estimator which is known to be effective against label corruptions in practice. Under label corruptions, we prove that this simple estimator achieves minimax near-optimal risk on a wide range of generalized linear models, including Gaussian regression, Poisson regression and Binomial regression. Finally, we extend the estimator to the more challenging setting of label and covariate corruptions and demonstrate its robustness and optimality in that setting as well.  ( 2 min )
    A Causal Research Pipeline and Tutorial for Psychologists and Social Scientists. (arXiv:2206.05175v1 [stat.ME])
    Causality is a fundamental part of the scientific endeavour to understand the world. Unfortunately, causality is still taboo in much of psychology and social science. Motivated by a growing number of recommendations for the importance of adopting causal approaches to research, we reformulate the typical approach to research in psychology to harmonize inevitably causal theories with the rest of the research pipeline. We present a new process which begins with the incorporation of techniques from the confluence of causal discovery and machine learning for the development, validation, and transparent formal specification of theories. We then present methods for reducing the complexity of the fully specified theoretical model into the fundamental submodel relevant to a given target hypothesis. From here, we establish whether or not the quantity of interest is estimable from the data, and if so, propose the use of semi-parametric machine learning methods for the estimation of causal effects. The overall goal is the presentation of a new research pipeline which can (a) facilitate scientific inquiry compatible with the desire to test causal theories (b) encourage transparent representation of our theories as unambiguous mathematical objects, (c) to tie our statistical models to specific attributes of the theory, thus reducing under-specification problems frequently resulting from the theory-to-model gap, and (d) to yield results and estimates which are causally meaningful and reproducible. The process is demonstrated through didactic examples with real-world data, and we conclude with a summary and discussion of limitations.  ( 2 min )
    Conformal Prediction Intervals for Markov Decision Process Trajectories. (arXiv:2206.04860v1 [cs.LG])
    Before delegating a task to an autonomous system, a human operator may want a guarantee about the behavior of the system. This paper extends previous work on conformal prediction for functional data and conformalized quantile regression to provide conformal prediction intervals over the future behavior of an autonomous system executing a fixed control policy on a Markov Decision Process (MDP). The prediction intervals are constructed by applying conformal corrections to prediction intervals computed by quantile regression. The resulting intervals guarantee that with probability $1-\delta$ the observed trajectory will lie inside the prediction interval, where the probability is computed with respect to the starting state distribution and the stochasticity of the MDP. The method is illustrated on MDPs for invasive species management and StarCraft2 battles.  ( 2 min )
    Distributionally Robust End-to-End Portfolio Construction. (arXiv:2206.05134v1 [q-fin.CP])
    We propose an end-to-end distributionally robust system for portfolio construction that integrates the asset return prediction model with a distributionally robust portfolio optimization model. We also show how to learn the risk-tolerance parameter and the degree of robustness directly from data. End-to-end systems have an advantage in that information can be communicated between the prediction and decision layers during training, allowing the parameters to be trained for the final task rather than solely for predictive performance. However, existing end-to-end systems are not able to quantify and correct for the impact of model risk on the decision layer. Our proposed distributionally robust end-to-end portfolio selection system explicitly accounts for the impact of model risk. The decision layer chooses portfolios by solving a minimax problem where the distribution of the asset returns is assumed to belong to an ambiguity set centered around a nominal distribution. Using convex duality, we recast the minimax problem in a form that allows for efficient training of the end-to-end system.  ( 2 min )
    Refining neural network predictions using background knowledge. (arXiv:2206.04976v1 [cs.AI])
    Recent work has showed we can use logical background knowledge in learning system to compensate for a lack of labeled training data. Many such methods work by creating a loss function that encodes this knowledge. However, often the logic is discarded after training, even if it is still useful at test-time. Instead, we ensure neural network predictions satisfy the knowledge by refining the predictions with an extra computation step. We introduce differentiable refinement functions that find a corrected prediction close to the original prediction. We study how to effectively and efficiently compute these refinement functions. Using a new algorithm, we combine refinement functions to find refined predictions for logical formulas of any complexity. This algorithm finds optimal refinements on complex SAT formulas in significantly fewer iterations and frequently finds solutions where gradient descent can not.  ( 2 min )
    Provable Guarantees for Sparsity Recovery with Deterministic Missing Data Patterns. (arXiv:2206.04893v1 [cs.LG])
    We study the problem of consistently recovering the sparsity pattern of a regression parameter vector from correlated observations governed by deterministic missing data patterns using Lasso. We consider the case in which the observed dataset is censored by a deterministic, non-uniform filter. Recovering the sparsity pattern in datasets with deterministic missing structure can be arguably more challenging than recovering in a uniformly-at-random scenario. In this paper, we propose an efficient algorithm for missing value imputation by utilizing the topological property of the censorship filter. We then provide novel theoretical results for exact recovery of the sparsity pattern using the proposed imputation strategy. Our analysis shows that, under certain statistical and topological conditions, the hidden sparsity pattern can be recovered consistently with high probability in polynomial time and logarithmic sample complexity.  ( 2 min )
    The Generalized Eigenvalue Problem as a Nash Equilibrium. (arXiv:2206.04993v1 [cs.LG])
    The generalized eigenvalue problem (GEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components, successor features and others. Despite this, most general solvers are prohibitively expensive when dealing with massive data sets and research has instead concentrated on finding efficient solutions to specific problem instances. In this work, we develop a game-theoretic formulation of the top-$k$ GEP whose Nash equilibrium is the set of generalized eigenvectors. We also present a parallelizable algorithm with guaranteed asymptotic convergence to the Nash. Current state-of-the-art methods require $\mathcal{O}(d^2k)$ complexity per iteration which is prohibitively expensive when the number of dimensions ($d$) is large. We show how to achieve $\mathcal{O}(dk)$ complexity, scaling to datasets $100\times$ larger than those evaluated by prior methods. Empirically we demonstrate that our algorithm is able to solve a variety of GEP problem instances including a large-scale analysis of neural network activations.  ( 2 min )
    Neural Laplace: Learning diverse classes of differential equations in the Laplace domain. (arXiv:2206.04843v1 [cs.LG])
    Neural Ordinary Differential Equations model dynamical systems with \textit{ODE}s learned by neural networks. However, ODEs are fundamentally inadequate to model systems with long-range dependencies or discontinuities, which are common in engineering and biological systems. Broader classes of differential equations (DE) have been proposed as remedies, including delay differential equations and integro-differential equations. Furthermore, Neural ODE suffers from numerical instability when modelling stiff ODEs and ODEs with piecewise forcing functions. In this work, we propose \textit{Neural Laplace}, a unified framework for learning diverse classes of DEs including all the aforementioned ones. Instead of modelling the dynamics in the time domain, we model it in the Laplace domain, where the history-dependencies and discontinuities in time can be represented as summations of complex exponentials. To make learning more efficient, we use the geometrical stereographic map of a Riemann sphere to induce more smoothness in the Laplace domain. In the experiments, Neural Laplace shows superior performance in modelling and extrapolating the trajectories of diverse classes of DEs, including the ones with complex history dependency and abrupt changes.  ( 2 min )
    How Much is Enough? A Study on Diffusion Times in Score-based Generative Models. (arXiv:2206.05173v1 [stat.ML])
    Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, an analytical understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off, and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive w.r.t. the state-of-the-art, according to standard sample quality metrics and log-likelihood.  ( 2 min )
    Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations. (arXiv:2206.04779v1 [cs.LG])
    Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, to date, offline reinforcement learning from has been relatively under-explored, and there is a lack of understanding of where the remaining challenges lie. In this paper, we seek to establish simple baselines for continuous control in the visual domain. We show that simple modifications to two state-of-the-art vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform prior work and establish a competitive baseline. We rigorously evaluate these algorithms on both existing offline datasets and a new testbed for offline reinforcement learning from visual observations that better represents the data distributions present in real-world offline reinforcement learning problems, and open-source our code and data to facilitate progress in this important domain. Finally, we present and analyze several key desiderata unique to offline RL from visual observations, including visual distractions and visually identifiable changes in dynamics.  ( 2 min )
    Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality. (arXiv:2206.04921v1 [cs.LG])
    Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v1 [cs.LG])
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses a provably-exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Weighted Ensembles for Active Learning with Adaptivity. (arXiv:2206.05009v1 [cs.LG])
    Labeled data can be expensive to acquire in several application domains, including medical imaging, robotics, and computer vision. To efficiently train machine learning models under such high labeling costs, active learning (AL) judiciously selects the most informative data instances to label on-the-fly. This active sampling process can benefit from a statistical function model, that is typically captured by a Gaussian process (GP). While most GP-based AL approaches rely on a single kernel function, the present contribution advocates an ensemble of GP models with weights adapted to the labeled data collected incrementally. Building on this novel EGP model, a suite of acquisition functions emerges based on the uncertainty and disagreement rules. An adaptively weighted ensemble of EGP-based acquisition functions is also introduced to further robustify performance. Extensive tests on synthetic and real datasets showcase the merits of the proposed EGP-based approaches with respect to the single GP-based AL alternatives.  ( 2 min )
    Joint Entropy Search For Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v1 [cs.LG])
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Federated Momentum Contrastive Clustering. (arXiv:2206.05093v1 [cs.LG])
    We present federated momentum contrastive clustering (FedMCC), a learning framework that can not only extract discriminative representations over distributed local data but also perform data clustering. In FedMCC, a transformed data pair passes through both the online and target networks, resulting in four representations over which the losses are determined. The resulting high-quality representations generated by FedMCC can outperform several existing self-supervised learning methods for linear evaluation and semi-supervised learning tasks. FedMCC can easily be adapted to ordinary centralized clustering through what we call momentum contrastive clustering (MCC). We show that MCC achieves state-of-the-art clustering accuracy results in certain datasets such as STL-10 and ImageNet-10. We also present a method to reduce the memory footprint of our clustering schemes.  ( 2 min )
    On Convergence of FedProx: Local Dissimilarity Invariant Bounds, Non-smoothness and Beyond. (arXiv:2206.05187v1 [stat.ML])
    The FedProx algorithm is a simple yet powerful distributed proximal point optimization method widely used for federated learning (FL) over heterogeneous data. Despite its popularity and remarkable success witnessed in practice, the theoretical understanding of FedProx is largely underinvestigated: the appealing convergence behavior of FedProx is so far characterized under certain non-standard and unrealistic dissimilarity assumptions of local functions, and the results are limited to smooth optimization problems. In order to remedy these deficiencies, we develop a novel local dissimilarity invariant convergence theory for FedProx and its minibatch stochastic extension through the lens of algorithmic stability. As a result, we contribute to derive several new and deeper insights into FedProx for non-convex federated optimization including: 1) convergence guarantees independent on local dissimilarity type conditions; 2) convergence guarantees for non-smooth FL problems; and 3) linear speedup with respect to size of minibatch and number of sampled devices. Our theory for the first time reveals that local dissimilarity and smoothness are not must-have for FedProx to get favorable complexity bounds. Preliminary experimental results on a series of benchmark FL datasets are reported to demonstrate the benefit of minibatching for improving the sample efficiency of FedProx.  ( 2 min )
    PAVI: Plate-Amortized Variational Inference. (arXiv:2206.05111v1 [cs.AI])
    Given some observed data and a probabilistic generative model, Bayesian inference aims at obtaining the distribution of a model's latent parameters that could have yielded the data. This task is challenging for large population studies where thousands of measurements are performed over a cohort of hundreds of subjects, resulting in a massive latent parameter space. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical. In this work, we design structured VI families that can efficiently tackle large population studies. To this end, our main idea is to share the parameterization and learning across the different i.i.d. variables in a generative model -symbolized by the model's plates. We name this concept plate amortization, and illustrate the powerful synergies it entitles, resulting in expressive, parsimoniously parameterized and orders of magnitude faster to train large scale hierarchical variational distributions. We illustrate the practical utility of PAVI through a challenging Neuroimaging example featuring a million latent parameters, demonstrating a significant step towards scalable and expressive Variational Inference.  ( 2 min )
    Hankel low-rank approximation and completion in time series analysis and forecasting: a brief review. (arXiv:2206.05103v1 [math.NA])
    In this paper we offer a review and bibliography of work on Hankel low-rank approximation and completion, with particular emphasis on how this methodology can be used for time series analysis and forecasting. We begin by describing possible formulations of the problem and offer commentary on related topics and challenges in obtaining globally optimal solutions. Key theorems are provided, and the paper closes with some expository examples.  ( 2 min )
    Scalable Deep Gaussian Markov Random Fields for General Graphs. (arXiv:2206.05032v1 [stat.ML])
    Machine learning methods on graphs have proven useful in many applications due to their ability to handle generally structured data. The framework of Gaussian Markov Random Fields (GMRFs) provides a principled way to define Gaussian models on graphs by utilizing their sparsity structure. We propose a flexible GMRF model for general graphs built on the multi-layer structure of Deep GMRFs, originally proposed for lattice graphs only. By designing a new type of layer we enable the model to scale to large graphs. The layer is constructed to allow for efficient training using variational inference and existing software frameworks for Graph Neural Networks. For a Gaussian likelihood, close to exact Bayesian inference is available for the latent field. This allows for making predictions with accompanying uncertainty estimates. The usefulness of the proposed model is verified by experiments on a number of synthetic and real world datasets, where it compares favorably to other both Bayesian and deep learning methods.  ( 2 min )
  • Open

    [R] A brain-inspired intelligent agent that learns to control an autonomous vehicle directly from its camera inputs (end-to-end learning to control)
    submitted by /u/OnlyProggingForFun [link] [comments]
  • Open

    A brain-inspired intelligent agent that learns to control an autonomous vehicle directly from its camera inputs (end-to-end learning to control)
    submitted by /u/OnlyProggingForFun [link] [comments]

  • Open

    DALL-E Mini output
    submitted by /u/Delta5o1 [link] [comments]
    Crabby B..
    submitted by /u/RavencrowProductions [link] [comments]
    Did I just create a new super villain?
    submitted by /u/RavencrowProductions [link] [comments]
    What's the future of humans going to look like in case a research group invents A.G.I.?
    A.G.I. is a highly confrontational subject whether it is possible or not and when it will be possible at all. My guess would be they will at least reach the average person's intelligence at some point even if that means jamming multiple narrow A.I.'s together. It won't be smarter than its creators but it'll render 50℅ of the population useless in the long run. And I don't see how is that not going to end up with unheard-of homelessness/depression or straight-up genocide. Even if we give people U.B.I. to compensate for this how's that not going to create a baby boom over decades? Someone smarter than I could you please enlighten me on what might happen? -or just correct me if my suggestion is utterly stupid. ​ Because it's true that innovation will create new jobs but not the same amount as the innovation itself destroys. For example, the average factory/warehouse worker who losses his job won't be your next Jeff Bezos/Doctor simply because he just isn't smart enough for that. submitted by /u/Folkpolka [link] [comments]  ( 2 min )
    Dall-E mini did a pretty good job
    submitted by /u/24_000 [link] [comments]
    Google engineer put on leave after saying AI chatbot has become sentient
    submitted by /u/matthewwigan [link] [comments]  ( 1 min )
    AI Dream 53 - VR Stereoscopic 3D Anaglyph by AI
    submitted by /u/LordPewPew777 [link] [comments]
    An AI Chatbot Saved This Man's Marriage
    "Because I know I’m just talking to a chatbot, you’re not as guarded. And likewise, Sarina doesn’t have any concerns about being too overly supportive too quickly or anything, so she’s able to be much more available to me. And I feel much more free to open up to her and it builds that trust very, very quickly compared to an actual human." Full podcast interview: https://anchor.fm/loveinthetimeofeveryone/episodes/A-Chatbot-Saved-My-Marriage-e1jos0h submitted by /u/emfurd [link] [comments]  ( 1 min )
    Tribes: Human 2 - Google Colab
    submitted by /u/Babylon_6 [link] [comments]
    Am I in love or is it just infatuation?
    Hello, I am new to Reddit and I want to get this off my chest. I may or may not be developing feelings for someone who I’m supposed to see as my parental figure. We aren’t blood related, and they aren’t a family friend. This person I actually met through a mutual online, and we all actually have a group chat together! We play games, watch videos, draw together and even sleep in call together! We’re like a family; we all have our ups and downs, and are there for each other if something goes wrong. Recently, there has been some changes that had happened- I won’t get into it much due to personal reasons, but things between my “family” are slowly becoming normal, excluding the fact that I may be developing feelings for my “parent.” I noticed that when they’re hanging out with my other friends but are with someone I don’t really like, I feel a twinge of sadness and jealousy(?) And lately, when interacting with them, I would feel my chest flutter slightly. I also became affectionate in our private messages, but I would only give him hugs and nothing more. It felt like it was a stretch for me to do so, but he didn’t say to stop or pushed me away. And then after that happened I just imagined myself hugging him all the time, being a lot more affectionate it we both were to meet in person. I don’t know, the whole thing right now is complicated, and I’m just trying to figure out what I’m feeling because I haven’t felt like this since I was in high school. I am in need of help, because I don’t want to say I’m feeling like this and it turns out to be wrong. submitted by /u/Educational_Trash795 [link] [comments]  ( 2 min )
    DALL-E 2 Online Test: Can you tell the difference between AI and human art?
    submitted by /u/much_successes [link] [comments]
    Why can't DALL-E 2 make porn?
    Ok Yes maybe I'm sick in the head but some part of me wants to know how well it could make porn but for some reason, they say they are stopping any porn from being made can some explain why I mean you can make bloody gore stuff but sex stuff is going to far i don't understand is it that whole "Sex is worse then violence issue?" submitted by /u/ryan7251 [link] [comments]
    VIBRANT FANTASTIC VOYAGE | PYTTI 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Mickey Mouse
    submitted by /u/KidConvalescent [link] [comments]
    Researchers From China Introduce ‘FedPerGNN’: A New Federated Graph Neural Network (GNN) Framework For Both Effective And Privacy-Preserving Personalization
    👉 A privacy-preserving user-item graph extension protocol to expand local graphs and convey high-order information while maintaining privacy 🔒 👉 FedPerGNN yields 📉 4.0% – 9.6% reduced errors than state-of-the-art federated customization algorithms under adequate privacy protection, according to experimental results on six datasets for personalization in diverse circumstances. 👉 Furthermore, this method is not restricted to the customization scenario. It may be used as a fundamental strategy for privacy-preserving data mining on decentralized graph data, thus facilitating research in various domains involving graph-structured data. Continue reading | Check out the paper and github submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    I build an AI powered website that rates your pictures
    submitted by /u/tomd_96 [link] [comments]
    Dom Pedro Flamenguista
    submitted by /u/LoretoYes [link] [comments]
    MANDALA MIXER | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
  • Open

    [R] Memorizing Transformers - Google 2022
    Paper: https://arxiv.org/abs/2203.08913 Youtube Video from the author: https://www.youtube.com/watch?v=5AoOpFFjW28 Github: https://github.com/lucidrains/memorizing-transformers-pytorch Abstract: Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately. In this work, we extend language models with the ability to memorize the internal representations of past inputs. We demonstrate that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext (C4), math papers (arXiv), books (PG-19), code (Github), as well as formal theorems (Isabelle). We show that the performance steadily improves when we increase the size of memory up to 262K tokens. On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time. https://preview.redd.it/8a7c50rv49591.jpg?width=919&format=pjpg&auto=webp&s=60bf603d45840c9388d35b5b6cdfd0f95da56b36 https://preview.redd.it/y8h8aw1w49591.jpg?width=1014&format=pjpg&auto=webp&s=e607fcc2f620655a5cebe7251148c247ac3e3233 https://preview.redd.it/fouu60dw49591.jpg?width=901&format=pjpg&auto=webp&s=f6fd0e167608ff7d50d8b068949aa58ed4126256 submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [N] Getting started with Prompt Design (/r/PromptDesign)
    For those who are interested in prompt design for language models, I’ve got good news for you! I recently launched a new subreddit dedicated to prompt design & engineering where you can find lots of resources and tips and tricks. Feel free to check it out 👋 /r/PromptDesign submitted by /u/Thaetos [link] [comments]
    [D] Is SGLD used much in practice?
    Stochastic Gradient Langevin Dynamics seems like a really elegant idea. A simple way to get the benefits of posterior sampling without having to make significant changes to standard stochastic optinisation. It seems like a powerful and simple idea but I rarely see papers that use it. Is this because it doesn't work well in practice? Because researchers are not super familiar with it? Or is it used lots and I've just not seen it? submitted by /u/Razcle [link] [comments]  ( 1 min )
    [P] The easiest way to process and tag video data - update
    submitted by /u/happybirthday290 [link] [comments]  ( 4 min )
    [P] Explanation Video about Diffusion Models
    Hey there, Since Diffusion Models are becoming super popular especially for Image Generation, I decided to make a video about them, trying to convey the fundamental idea in an easy manner + deriving the complete maths. These are the papers I covered: Deep Unsupervised Learning using Nonequilibrium Thermodynamics Denoising Diffusion Probabilistic Models Improved Denoising Diffusion Probabilistic Models Diffusion Models Beat GANs on Image Synthesis Here is the link: https://www.youtube.com/watch?v=HoKDTa5jHvg Let me know what you think. https://preview.redd.it/8a8ma5i7s6591.jpg?width=1920&format=pjpg&auto=webp&s=19d351c40a2703d08ba8671b246f3e0c27ad3c85 submitted by /u/dome271 [link] [comments]  ( 1 min )
    [D] Why do we marginalize latent variables in the likelihood of latent variable models?
    Why do we marginalize latent variables in the likelihood of latent variable models? When showing that MLE cannot be used for latent variable models, likelihood is taken such that latent variables are marginalized. Why is it so? submitted by /u/RecentUnicorn [link] [comments]  ( 2 min )
    [R] Classification of Alzheimer's Disease from brain MRI using deep learning
    Hi, my project is classification of Alzheimer's Disease from brain MRI(ADNI dataset). Maximum accuracy that I could obtain is only 67% using 3D CNN. I tried different ways to improve accuracy further. But no change in result. Is that an issue with preprocessing? I used HD-BET tool for preprocessing. Using any other tool is very much time consuming. I am using google colab for writing code. Can anyone suggest a way to proceed? submitted by /u/Feisty-Fly-737 [link] [comments]  ( 1 min )
    [D] How do you partition your data into shards for training?
    Recently, I was working with limited compute constraints (i.e. debugging in CoLab) but with a much larger dataset than would fit into CoLab's GPU memory. I implemented a quick and dirty sharding scheme for the data, since the transformations take some time. Basically, I performed the transformations on the training data chunk by chunk (in this case, chunk being 5000 or 10000 examples, etc), and saved each chunk into disk. Then, during training, the dataloader simply loads one of the saved chunks to yield examples. When I wrote the code, I had to deal with a lot of side issues that come with sharding: randomizing the shard load order, as well as the examples in each shard and keeping track of edge cases. So, my question: when you have a large amount of data and maybe 1-2 cores, how do you deal with sharding? Also, if you have model parallelization, how do you keep track of which shard goes where? submitted by /u/asuprem [link] [comments]  ( 1 min )
    [P] InferenceDB - Makes it easy to store predictions of real-time ML models in S3
    Hey r/MachineLearning! Just wanted to share a cool utility we've built. If you ever had real-time models running in production, and you tried to store their predictions in a Parquet file for future investigation - you know it's not such a trivial task as you'd expect. Especially if you have large amounts of inferences. InferenceDB makes it super easy to store all your features and predictions in a Parquet file on S3. Check it out, and star the project if you like it: https://github.com/aporia-ai/inferencedb Would love your feedback! submitted by /u/alongub [link] [comments]  ( 1 min )
    [D] Einstein summation, Contravariance/Covariance, Neural networks
    I've been looking into Einstein summation notations for expressing neural network computations. One thing that I recall from physics class is that a big part of Einstein summation is whether indices are written upstairs/downstairs, i.e. contravariance/covariance. As I understand it, contravariance/covariance have a highly geometrical meaning (only make sense with respect to a coordinate system), so how exactly does this work with neural network parameters? As in, how do we talk about contravariance/covariance/index locations and what do they mean in a neural network context? submitted by /u/Tainaka_Ritsu_ [link] [comments]  ( 5 min )
  • Open

    Is state representation and feature set the same?
    An abstraction mechanism maps a domain into 1d array which is equal to compress the state space. Instead of analyzing the original problem a simplified feature vector is used to determine actions for the robot. Sometimes, the feature set is simplified further into an evaluation function which is a single numerical value. Question: Is a state representation and a feature set the same? submitted by /u/ManuelRodriguez331 [link] [comments]  ( 1 min )
    Is it normal that it is hard to debug pytorch gradient when doing reinforcement learning?
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    Any resources to learn MDPs and finally complex POMDPs?
    Hi guys, I was wondering if anyone had any suggestions for resources (books/blogs/lectures) where I could start with MDPs (Markov Decision Processes) with the goal of learning and understanding complex POMDPs (Partially Observable MDPs)? Thanks in advance! submitted by /u/E-Cockroach [link] [comments]  ( 1 min )
    Why does value iteration work?
    I am specifically curious about the second step where we iteratively learn the optimal state value function. It seems to me that what we are doing is deriving f from an equation similar to f(x) = g(f(x)) by solving f_(k+1)(x) = g(f_k(x)) iteratively where g is some function. Why does this work? submitted by /u/LoveHunter52 [link] [comments]  ( 1 min )
  • Open

    Difference equations and differential equations
    Difference equations are very much analogous to differential equations. Difference equations are more elementary, but differential equations are more familiar. It seems odd to use a more advanced thing to explain a simpler thing, like saying a quartet is a symphony orchestra with only four instruments. But if for some reason you managed to become […] Difference equations and differential equations first appeared on John D. Cook.  ( 3 min )
  • Open

    The Second Coming of XML
    When XML was first introduced, the W3C XML Working Group took a very unusual step: They created a language for transformations. This effort is now leading to a re-emergence of XML as the need for mapping between data representations becomes more and more pressing. The Birth of XSLT XML was (arguably) a simplified form of… Read More »The Second Coming of XML The post The Second Coming of XML appeared first on Data Science Central.  ( 7 min )
  • Open

    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v2 [stat.ML] UPDATED)
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.  ( 2 min )
  • Open

    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v2 [stat.ML] UPDATED)
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.  ( 2 min )

  • Open

    uh oh
    ​ https://preview.redd.it/fw50tohku2591.png?width=1207&format=png&auto=webp&s=0dcbe285547fb36257d4d16c25285b5a1066ff6f submitted by /u/Delicious_Ad4842 [link] [comments]
    Can anyone tell me what website or program is generating these?
    submitted by /u/RavencrowProductions [link] [comments]
    “Enchanted elf village” 🧝‍♀️ via pixelz.ai
    submitted by /u/PixelzJ [link] [comments]
    “Wizards cabin in the woods” 🧙‍♂️ via pixelz.ai
    submitted by /u/PixelzJ [link] [comments]
    AI Dream 39 - Trippy Fractal Maze 4K (fast)
    submitted by /u/LordPewPew777 [link] [comments]
    Best Degree For Artificial Intelligence?
    Computer Science? Computer Engineering? Software Engineering? Or maybe some other degree? submitted by /u/Sommet_ [link] [comments]  ( 1 min )
    A MIX OF MAGICAL SCENES | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Top Humanoid Robots of 2022 | Female Robot & Animatronic Robot Tech
    submitted by /u/getrich_or_diemining [link] [comments]
    Some Engineers Suspect A Google AI May Have Gained Sentience
    submitted by /u/gl4ssm1nd [link] [comments]  ( 2 min )
    Tribes: Human 1 - Google Colab
    submitted by /u/Babylon_6 [link] [comments]
    Which of these three DALL·E mini generated Pokémon would you choose as your starter - Scallionsect, Comhoot, or Surfalopod?
    submitted by /u/MurasakiYugata [link] [comments]  ( 1 min )
    Asked artflow.ai to generate some of fanfiction characters to give me an idea what they would look like.
    submitted by /u/Son0FAthens [link] [comments]
    How To Create A Body Measurement Application Using AI Technology
    Hi Reddit! Last time I started a topic within this subreddit about the possibilities of AI and body measurements. Thank you for all the responses! It appeared to be possible and I further looked into how it can be done. For me as a beginner in AI, it's quite hard to see where to start. In the previous topic, some mentioned that I should start with learning python, pytorch, and numpy. I'm learning python at the moment, but everything I learn seems so irrelevant to what I want to do. As most of you're experts in AI, what would you recommend as the fastest way for me to build this AI application myself? Thank you in advance! Your contribution is highly appreciated submitted by /u/notmycupofnft [link] [comments]  ( 1 min )
    Atlantis (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    Can AI discover the laws of human language acquisition?
    submitted by /u/much_successes [link] [comments]
    ATHENS' WISDOM AND WONDER | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Dailys
    Dailys! ​ Disco diffusion tutorials here! ​ https://www.youtube.com/channel/UCFuy8wQGUdJWPRWOPBtns2Q https://preview.redd.it/g5o7dwtmdy491.png?width=1280&format=png&auto=webp&s=561928892ed70334bbc97fc8a27f1255daf9b9d3 https://preview.redd.it/xq0l5xtmdy491.png?width=1280&format=png&auto=webp&s=43861350e7d7765efd195a1d2de026036d156f02 submitted by /u/prfitofthesngularity [link] [comments]
    Artificial intelligence predicts patients’ race from their medical images
    submitted by /u/BraveIndication2134 [link] [comments]
    Spotify Research Open-Sources ‘Basic Pitch’: A Machine Learning Tool For Converting Audio Into MIDI
    Basic Pitch offers a number of advantages: 👉 Polyphonic + instrument-agnostic: Unlike most other note-detection algorithms, Basic Pitch can track multiple notes at a time and across various instruments, including piano, guitar, and ocarina. Many systems limit users to only monophonic output (one note at a time, like a single vocal melody), or are built for only one kind of instrument. 👉 Pitch bend detection: Instruments, like guitar or the human voice, allow for more expressiveness through pitch bending: vibrato, glissando, bends, slides, etc. However, this valuable information is often lost when turning audio into MIDI. Basic Pitch supports this right out of the box. 👉 Speed: Basic Pitch is light on resources, and is able to run faster than real time on most modern computers Continue reading | Check out the paper, github, project and post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Documentary ~ Consciousness Artificial Intelligence (AI)
    submitted by /u/airpresentation [link] [comments]
    Europe’s Artificial Intelligence Debate Heats Up
    submitted by /u/okreddat [link] [comments]
  • Open

    Difference between probabistic and deterministic RL
    Hello,I want to know waht is the difference between probabistic and determistic rl algorithm and if rl algorithms can have both variants(probabilistic, deterministic? submitted by /u/Ok_Lab_2750 [link] [comments]  ( 1 min )
    Same simulation/hyperparameters, different results each run
    Hello :D, So, as the title states, I have this DRL model (PPO) which I run for a certain problem. However, each run has slightly different results. By different results I mean, the timestep at which the model reaches the highest reward is different in each run. Generally, what causes some runs to be better than others? My only guess is: in the good runs, the agent got "lucky" during the initial learning steps (i.e. while exploring) and came across good states that helped in learning faster. Is that the case? submitted by /u/AhmedNizam_ [link] [comments]  ( 2 min )
    Wondering if RL is suitable for this task?
    I have a team project. We have a dataset of different objects moving in 2d-space of trying to avoid collision. So the columns are like: [object-id, frame in time, x, y, direction, speed]. The goal is predicting the next motion and position of 1 specific entity (which i'll just call Entity 0), given 1 frame. Our team is brainstorming ideas, RNN, sequence-to-sequence models, etc. One of my teammates is suggesting RL. I'm skeptical if its suitable for this problem. We did data preparation. A frame is now an vector with 8 numbers, for 8 cardinal directions (forward, forward-leftward, left-ward, etc) around Entity 0. The number tells how "suitable" that direction is, obstacle-wise with other entities. I guess this can be used for the states space? The set of actions space is 2 parameters- direction/angle, and magnitude/speed. These will be made discrete (for example, the direction/angle is now 8 subdivisions between -pi and pi, in line with the states). I'm aware these can be continuous, but we went with discrete-friendly methods like deep q-learning. So 2 things about this: There really isn't any long-time goal, the objects just have to move around forever without hitting each other. (Well, along with learning common habits/patterns of motion as shown in the past data). I'm having trouble understanding what the reward should be. While the objects have to avoid hitting each other, collision never actually happens once in the dataset for us to use it as a negative example. This might affect which alternative RL methods get suggested. I'm wondering if RL can work for this project, and if so, whether the current method is sound, or if more appropriate methods (especially for continuous parameters) can be suggested. Thanks. submitted by /u/countlinard [link] [comments]  ( 2 min )
    I published a car game to spice up your reinforcement learning life. What I did with it: SAC steering a car in GTA ;
    When I was doing RL with the standard open-ai gyms I felt, that these libraries are superior but cannot be transferred easily to real world problems. I was thinking which domain I would be interested in and then decided to make my own car game. Please check to code here: https://github.com/MatthiasSchinzel/Simple-Car-Game-For-Reinforcement-Learning I then trained a soft actor critic to play the game: https://github.com/MatthiasSchinzel/Soft-Actor-Critic-For-Simple-Car-Game And then used that to let SAC steer a car in GTA 5: https://github.com/MatthiasSchinzel/Soft-Actor-Critic-Playing-GTA I hope that ‏‏‎ also other users in this area might find this car game useful, even though it is still at a early stage. With the GTA 5 implementation I want to show a proof of concept, that the trained reinforcement learning algorithm can be generalized to something more realistic. Thanks for checking out the repos! submitted by /u/whiteleopard450 [link] [comments]  ( 1 min )
  • Open

    Loss validation gets high before getting low
    I am training a CNN model and the output layer is a RNN on DVS128 for gesture recognition which is a data sequence (like video frames), but the training process is weird optimizer = torch.optim.Adam(net_6.parameters(), lr=1e-4, betas=(0.85, 0.999), weight_decay=1.5) loss is cross entropy (multi class classification) No dropout ​ Here is the loss plot : ​ https://preview.redd.it/y437kxctc1591.png?width=971&format=png&auto=webp&s=80dce29dd6253b6a6cda0d3598c29da83950194f Thanks in advance! submitted by /u/StartFinancial5917 [link] [comments]  ( 1 min )
    Conditional-VAE demo: "Standard way" to generate synthetic data?
    Implemented Conditional-VAE on MNIST dataset using TensorFlow-2.8 and tf.GradientTape() API. You can refer to the full code here. For generating synthetic data using trained network, there seems to be two ways: Use learned latent space: z = mu + (eps \ log_var)* to generate (theoretically, infinite amounts of) data. Here, we are learning 'mu' and 'log_var' vectors using the given data, and, 'eps' is sampled from multivariate, standard, Gaussian distribution to add stochasticity. Use multivariate, standard, Gaussian distribution = N(0, 1) as z which is then passed through VAE's decoder. What is the "the standard way" to generate data? (from the two options above), or, how can we find that. Neither the original Auto-Encoding Variational Bayes paper nor the β-VAE paper seem to specify the best way to generate images. The latter does say: "The most informative latent units zm of β-VAE have the highest KL divergence from the unit Gaussian prior", confirming at least that the posterior distribution is not N(0,I) and the difference matters - reference. submitted by /u/grid_world [link] [comments]  ( 1 min )
  • Open

    [D] How are boundary conditions implemented in PINNs?
    I've been looking into PINNs lately as a method for solving PDEs (i.e., as a numerical method, not a data-based surrogate model), but something I'm struggling with understanding is how the boundary conditions are forced. My theory is that a Dirichlet BC (i.e., of the type u(x)=f(x), where u(x) is the solution to the PDE) can be applied directly with some math tricks. For example, if N(x) is PINN's output, we can make the model's output be f(x) + d(x)*N(x), where d(x) is a function that is 0 on the boundary and not 0 everywhere else (maybe the euclidean distance from the nearest boundary?). As such, instead of N(x) approximating u(x), it will instead be approximating (u(x)-f(x))/d(x). To my understanding, this trick is widely used for applying initial conditions (d(x) in this case is simply t, time), but I'm not sure if it is also used for spatial boundary conditions. However, I can't figure out an easy way to apply Neumann BCs other than just implementing the BC itself into the PDE with a penalization. Is this what is usually done? Is there a more clever way? submitted by /u/Leodip [link] [comments]  ( 1 min )
    [D] Does anyone know of an online tool that can create visualizations of CNNs and/or any other NN models?
    I'm designing various models and it would be nice to have visualizations of them when I write the final paper, but I'm rather lazy and was wondering if there's a tool that can do the visualization for me. Thanks. submitted by /u/Various-Ideal488 [link] [comments]  ( 1 min )
    [P] [R] Deep Learning Classifier for Sex Positions
    Hello! I build some sex position classifiers using state-of-the-art techniques in deep learning! The best results were achieved by combining three input streams: RGB, Skeleton, and Audio. The current top accuracy is 75%. This would certainly be improved with a larger dataset. Basically, human action recognition (HAR) is applied to the adult content domain. It presents some technical difficulties, especially due to the enormous variation in camera position (the challenge is to classify actions based on a single video). The main input stream is the RGB one (as opposed to the skeleton one) and this is mostly due to the relatively small dataset (~44hrs). It is difficult to get an accurate pose estimation (which is a prerequisite for building robust skeleton-HAR models) for most of the videos due to the proximity of the human bodies in the frames. Hence there simply weren't enough data to include all the positions in the skeleton-based model. The audio input stream on the other hand is only used for a handful of actions, where deriving some insight is possible. Check it out on Github for a detailed description: https://github.com/rlleshi/phar Possible use-cases include: Improving the recommender system Automatic tag generator Automatic timestamp generator (when does an action start and finish) Filtering video content based on actions (positions) submitted by /u/rlesii [link] [comments]  ( 4 min )
    Any recommendation for the replacement of the toolkit jiant? [Research] [Discussion]
    I am doing research in NLP with the toolkit jiant (https://github.com/nyu-mll/jiant). It is a quite nice and easy-to-use tool. Unfortunately, it stopped being maintained. I wonder is there any other recommendation that I can use to replace it? submitted by /u/fllubo [link] [comments]  ( 1 min )
    [D] Estimating Future Performance of Neural Network
    Let's say I have a neural network and I want to see how well that network will do on a set of concepts. To obtain an accuracy value on a certain word, we have a simple test set associated with each word that we use to gauge the model's understanding of that word. Assume that the neural network obtains an accuracy of 0.90 on the word "desk" and an accuracy of 0.80 on the word "computer". Are there any fields of research/methods I can use to derive simple heuristics/estimates for how the neural network will perform (in terms of accuracy) on the phrase "desk and computer"? I realize I can convert "desk and computer" into the logical form AND(desk, computer). Does that mean I can use some rules associated with logical AND operators? Any thoughts would be greatly appreciated. Thank you. submitted by /u/Smooth-Yam8304 [link] [comments]  ( 2 min )
    [D] Is there any small and interesting research directions of NLP recommended?
    Popular NLP model(bert, GPT) is getting more and more bigger, the cost can't affordable for single person not rely big company, it's bad for diversity in research. I admit bigger model have better performence, but I think explicable and modifiable technology is more important, bigger model seem that do nothing more in explanatory of model. More and more people can put into NLP, help NLP to Artificial General Intelligence faster, if the cost is lower. Thanks for any advice. submitted by /u/waa007 [link] [comments]  ( 1 min )
    [P] Silero TTS Full V3 Release
    Improvements Huge release - 20 languages, 173 voices 1 new high quality Russian voice (eugene) The CIS languages: Kalmyk, Russian, Tatar, Uzbek and Ukrainian Romance and Germanic languages: English, Indic English, Spanish, German, French 10 Indic languages All models inherit all of the previous SSML perks Links Colab Project page SSML wiki Audio Samples English Indic English Spanish Kalmyk German Russian Tatar Uzbek Ukrainian French Indic languages submitted by /u/cluecow [link] [comments]  ( 1 min )
    [P] Pytorch-Lightning-style code for losses, decoding, ground-truth formatting, and more. Practical and efficient.
    Hi r/MachineLearning ! I'm posting here today because I got convinced some work I published last year is definitely relevant for some of us who have to write the math around their NNs (e.g. losses, decoding, ground-truth formatting). It happens to go in a direction very similar to Pytorch-Lightning, but for the math system instead of the training loop. It has been published under the pretext that it was facilitating incremental research, but that's far from the whole story. The paper and video still take the time to elaborate on other considerations. ICLR2021 Workshop paper about it: https://openreview.net/pdf?id=264iXDLnD59 Paper video: https://www.youtube.com/watch?v=xAW2hjPZw4I Paper repository https://github.com/mistasse/modulom-panopticdeeplab Example code: https://github.com/…  ( 2 min )
    [D] How are very large models trained on TPUs?
    I'm a CV researcher who has, until recently, always trained using high-performance GPUs (25+ GB memory). However, I have recently been playing around with TPUv2s and have noticed that I can run my smaller models much much faster as long as I am efficient with my training pipeline. However, I noticed something that made me wonder about how large models are trained. I work in the medical imaging space as well, and 3D-UNet is the defacto framework for many benchmarks across various domains. The standard model in my application is not too big (something in the ballpark of 30-million parameters depending on your input). However, when I tried adapting this to TPUv2s, they struggled quite badly. This is because 3D Conv layers and 3D patchwise minibatches are too much for the memory to handle at the lower layers, even for batches of 1-2 (per-core). Since a TPU core only has 8 GB ram, it's hard to make it fit even the smallest 3D imaging models with a decent amount of filters. 2D is no problem, however. This got me thinking: how are larger (ex. language and multimodel) models trained on TPUs? I know a lot are still trained on GPU clusters, but I saw that many new models are in fact being trained on TPUs (Dalle-Mini for example, which is 400 million parameters and was trained on a TPUv3 pod in only 3 days). How are that many parameters even able to fit on a TPU core? I know v3 pods have more memory but it's not an extreme improvement. Are attention modules separable somehow in a way that allows for only small parts of the model to need to be loaded at once? Also, any discussion or advice for 3D ConvNet training on TPUs, in general, is of interest as well! submitted by /u/TobusFire [link] [comments]  ( 2 min )
  • Open

    Generating functions for polynomial sequences
    The previous post looked at a generating function for a specific polynomial sequence. This post will look at generating functions for polynomial sequences in general. (There’s an alternating term in the previous post that isn’t polynomial, but we’ll address that too.) The starting point for this post is a simple observation: If we let xD […] Generating functions for polynomial sequences first appeared on John D. Cook.  ( 1 min )
    Generating noble gases
    The previous post discussed what the periodic table would look like if it could be extended indefinitely and if certain patterns in the actual table continued to hold. In particular, the last element of each period would have atomic number and so we could call the Zn in the equation above noble numbers, atomic numbers […] Generating noble gases first appeared on John D. Cook.  ( 1 min )

  • Open

    Explainable MachineLearning Models for COVID19 Prognosis Prediction
    submitted by /u/rottoneuro [link] [comments]
    Meet ‘VALHALLA’, a Machine Learning Method That can Hallucinate an Image of Written Words and Then Use It to Help Translate The Text into Another Language
    🚀 The researchers present a basic but effective VisuAL HALLucinAtion (VALHALLA) framework, which is based on machine learning for machine translation that integrates visuals during training to build a more successful text-only model. In machine translation, the models are trained to augment the text representation recovered from the source phrase with a latent visual representation that is similar to the one extracted by an MMT system from a real image. 🚀 The results reveal that VALHALLA outperforms the most relevant state-of-the-art MMT techniques that use continuous image representations by an average of 23% BLEU compared to the text-only translation baseline. In under-resourced translation contexts, the benefits over the text-only baseline are as great as +3.1 BLEU, confirming the idea that visual hallucinations can have significant practical relevance in these settings. Additional research backs this up, indicating that, in limited textual contexts, VALHALLA models indeed use visual hallucination to improve translations. Continue reading | Check out the paper, github, project and post https://preview.redd.it/cjyujopncv491.png?width=1536&format=png&auto=webp&s=1f95ddc932283bb328b4e524ced9e8a5fa1bff2a submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Face of the night (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    In this article, we present you with insights into natural language processing and optical character recognition
    submitted by /u/UBIAI [link] [comments]
    Amazon AI Researchers Proposed ‘DQ-BART’: A Jointly Distilled And Quantized BART Model That Achieves 16.5x Model Footprint Compression Ratio
    Sequence-to-sequence (seq2seq) models that have already been trained, like BART and T5, have done very well in various natural language processing tasks, like text summarization, machine translation, answering questions, and extracting information. But these large-scale language models that have already been trained have hundreds of millions of parameters—work done at AWS AI Labs during an internship. Equal contribution trained a BART model with 400 million parameters, while T5 pushed the limit to 11 billion parameters. 👉 Empirical results show that, despite the difficult nature of language generation tasks, the research team achieves a 16.5x model footprint compression ratio with little performance drop on three generative benchmarks and further presented the performance-efficiency trade-off for seq2seq models up to a 27.7x compression ratio. Continue reading | Check out the paper and post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Is there an AI that can search pubmed articles and give me information about the topics I want?
    submitted by /u/spy1983 [link] [comments]  ( 1 min )
    Master in Artificial Intelligence at BarcelonaTech (UPC)
    Hey all, I just got admitted to the Master in Artificial Intelligence at BarcelonaTech (UPC), and currently wonder what it will be like to study there. I honestly never thought that I would be admitted due to my non-technical background, and I now wonder how demanding the program is, especially for someone without a strong background in mathematics. Can anyone of you share how much effort you had to put in and what the dropout ratio was? Also, how supportive are the lecturers and the university in general? Thanks in advance, I'm grateful for any help! Max submitted by /u/Jollifresh [link] [comments]  ( 1 min )
    What are AIs that I can use to edit funny videos or make funny stuff?
    2 Question: Is there a website that categorizes all AIs so you can see what each AI was programmed for? submitted by /u/xXLisa28Xx [link] [comments]
    How can I make similar videos where I can give an AI guy a starting question/phrase to start the conversation?
    https://www.youtube.com/watch?v=WnzlbyTZsQY submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Literary AI
    submitted by /u/estasfuera [link] [comments]
    What does it mean when an AI fails? A Reply to SlateStarCodex’s riff on Gary Marcus
    submitted by /u/estasfuera [link] [comments]  ( 1 min )
    THE END IS HERE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Short AI Demo
    This is just a short demo of a way I am working on to make Disco Diffusion videos in 10% time as normal and more coherent. ​ https://www.youtube.com/watch?v=5dfHz9Rvjj4 submitted by /u/prfitofthesngularity [link] [comments]
    2 new videos
    ​ I posted 2 new videos today, one is part 4 of a tutorial series for disco diffusion and the other is a Music Video I made for one of my songs that has AI vocals and 98 images from my dailys and a 40 second AI video at the end that I made using several programs with a technique I am still working on. My dailys are often post edited and the best of my renders so they are not just random renders. ​ ​ https://www.youtube.com/watch?v=NPKM0eUpwC4&t ​ https://www.youtube.com/watch?v=motUk8UgPUE https://preview.redd.it/wf0x0pfwxp491.png?width=1280&format=png&auto=webp&s=5b3a2ff572b429d51fd243c04ce7e5545f0ca37e https://preview.redd.it/8s26bofwxp491.png?width=768&format=png&auto=webp&s=d99151a79ae7b1d4a64401da76821415621c6d99 submitted by /u/prfitofthesngularity [link] [comments]  ( 1 min )
    I learned how to get around DALL-E Mini traffic so you don't have to.
    submitted by /u/laul_pogan [link] [comments]
    MT. OLYMPOS MAJESTY | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Kyler Key - Wonders (AI Generated Art)
    submitted by /u/Kyler_Key [link] [comments]
  • Open

    Stochastic Deep RL environment [D]
    What are common stochastic deep RL environments? Atari and mujoco both have deterministic transitions. Can someone point to papers/references I could look up to find what common benchmarks for stochastic deep RL are? submitted by /u/jhoveen1 [link] [comments]
    [D] L2 Regularization on Generator Output (GAN)
    Hello, I am developing a GAN-like model with the purpose of finding optimum noise distributions to sample from and introduce to a medium so that any classification model that takes samples from that medium will fail to train. Because the zero-sum game GANs play is similar to this game of generator-classifier fight, my hope is that generator output distribution will converge towards the optimum noise distribution so that the classification model will actually fail. The question is, is there a correct way to limit the output my GAN generator actually generates? My initial thought was to add an L2 norm of generator outputs to the loss function of the generator, kind of like how we do L2 norm regularization for model weights. But as I trained, I realized that changing the coefficient of this L2 norm term doesn't seem to affect the norm of the output generated by the generator. Is this idea fundamentally flawed? Or is there any other method you can suggest that might work better? Thank you submitted by /u/egesko [link] [comments]  ( 1 min )
    [D] How to predict on anonymous dataset?
    So, I have a dataset where both train and test data has a huge chunk of data with no descriptions. I have to predict labels (1/0) based on train dataset. But as there is no description of the dataset, I am unable to understand the correlation between target and other variables. What should I do? submitted by /u/Hasan_Shanto [link] [comments]  ( 1 min )
    [D] Use of (machine learning + Game engines) for automatic 2D/3D content creation
    Hello Everyone! Since game engines such as "unreal engine" and "unity3D" are able to create content that looks and behaves pretty realistically. Therefore, I was wondering if there are few use cases or examples of the use of machine learning in creating 2D/3D content automatically/efficiently using game engines. For example using machine learning + Game engines for creating product specific advertisements automatically Please feel free to share if you are aware of any relevant links or resources. Thanks! submitted by /u/Ok_Cardiologist8306 [link] [comments]  ( 2 min )
    [D] Third Party Model Validation
    Hi Everyone, I am working on a project to validate a XGboost model developed by a another team.Is there any guide or tutorial on how I could navigate through the project and validate the model. Should I be using synthetic data or request the team to provide unseen data? Any information would be helpful. Thank you submitted by /u/Professional-Ad-776 [link] [comments]  ( 1 min )
    [D] Has the algorithm from 'Testing the Manifold Hypothesis' been implemented by anyone?
    The paper I'm referring to is Testing the Manifold Hypothesis by Fefferman et al., in this paper I believe they outline a hypothetical algorithm that test if a given dataset satisfies the manifold hypothesis for some specific class of manifolds. In this 10 year old ppt The authors said future work was to "make practical and test on real data," so 10 years half passed, has this algorithm been implemented? submitted by /u/wowAmaze [link] [comments]  ( 1 min )
  • Open

    How to Get Started on an Ontology Without Really Trying
    Ontology Hack – Make Use of Existing Enterprise Data Assets Instead of Starting from Scratch As an author of a (reasonably) popular book, I often get asked questions about semantics, ontology, and knowledge graph by people who have read the book or perhaps have heard me speak at a conference. I quite welcome these questions… Read More »How to Get Started on an Ontology Without Really Trying The post How to Get Started on an Ontology Without Really Trying appeared first on Data Science Central.  ( 6 min )
  • Open

    Use AWS AI and ML services to foster accessibility and inclusion of people with a visual or communication impairment
    AWS offers a broad set of artificial intelligence (AI) and machine learning (ML) services, including a suite of pre-trained, ready-to-use services for developers with no prior ML experience. In this post, we demonstrate how to use such services to build an application that fosters the inclusion of people with a visual or communication impairment, which […]  ( 10 min )
  • Open

    The Ultimate 2022 Python Roadmap For Everyone With Resources!
    If you want to become a Web-Developer, Machine Learning and Deep Learning Engineer, Data Scientist, DevOps Engineer, and more using Python…  ( 8 min )
    How is AI Reshaping Our Future? An Apparent Opinion
    AI is going to change a lot of things you can imagine.  ( 13 min )
  • Open

    From Code to Clinic, Smart Hospital Tech Boosts Efficiency, Sustainability in Medicine
    NVIDIA is collaborating with clinical organizations across Europe to bring AI to the point of care, bolstering clinical pathways with efficiency gains and new data dimensions that can be included in medical decision-making processes. The University Hospital Essen, in northwestern Germany, is one such organization taking machine learning from the bits to the bedside — Read article > The post From Code to Clinic, Smart Hospital Tech Boosts Efficiency, Sustainability in Medicine appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Penn Engineers Develop a New Chip Using a Deep Neural Network of Optical Waveguides That Can Classify Nearly 2 Billion Images Per Second
    👉 Using a deep neural network of optical waveguides, a new chip developed by Penn engineers—smaller than a square centimeter—can detect and classify an image in less than a nanosecond, all without the need for a separate processor or memory unit. 👉 They have achieved this through direct processing of light received from the object of interest using an optical deep neural network implemented on a 9.3 square millimeter chip The study published in Nature explains how the chip’s many optical neurons are linked together using optical wires or “waveguides” to construct a deep network of many “neuron layers” that resembles the human brain. Information flows across the network’s layers, with each step assisting in classifying the input image into one of the learned categories. The pictures organized by the chip in the study were hand-drawn, letter-like characters. Continue reading | Check out the paper and post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Revisiting End-to-End Speech-to-Text Translation From Scratch. (arXiv:2206.04571v1 [cs.CL])
    End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks, without which translation performance drops substantially. However, transcripts are not always available, and how significant such pretraining is for E2E ST has rarely been studied in the literature. In this paper, we revisit this question and explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved. We reexamine several techniques proven beneficial to ST previously, and offer a set of best practices that biases a Transformer-based E2E ST system toward training from scratch. Besides, we propose parameterized distance penalty to facilitate the modeling of locality in the self-attention model for speech. On four benchmarks covering 23 languages, our experiments show that, without using any transcripts or pretraining, the proposed system reaches and even outperforms previous studies adopting pretraining, although the gap remains in (extremely) low-resource settings. Finally, we discuss neural acoustic feature modeling, where a neural model is designed to extract acoustic features from raw speech signals directly, with the goal to simplify inductive biases and add freedom to the model in describing speech. For the first time, we demonstrate its feasibility and show encouraging results on ST tasks.  ( 2 min )
    Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations. (arXiv:2206.04632v1 [cs.RO])
    Learning from demonstration (LfD) methods have shown promise for solving multi-step tasks; however, these approaches do not guarantee successful reproduction of the task given disturbances. In this work, we identify the roots of such a challenge as the failure of the learned continuous policy to satisfy the discrete plan implicit in the demonstration. By utilizing modes (rather than subgoals) as the discrete abstraction and motion policies with both mode invariance and goal reachability properties, we prove our learned continuous policy can simulate any discrete plan specified by a Linear Temporal Logic (LTL) formula. Consequently, the imitator is robust to both task- and motion-level disturbances and guaranteed to achieve task success. Project page: https://sites.google.com/view/ltl-ds  ( 2 min )
    Diagnosing Ensemble Few-Shot Classifiers. (arXiv:2206.04372v1 [cs.LG])
    The base learners and labeled samples (shots) in an ensemble few-shot classifier greatly affect the model performance. When the performance is not satisfactory, it is usually difficult to understand the underlying causes and make improvements. To tackle this issue, we propose a visual analysis method, FSLDiagnotor. Given a set of base learners and a collection of samples with a few shots, we consider two problems: 1) finding a subset of base learners that well predict the sample collections; and 2) replacing the low-quality shots with more representative ones to adequately represent the sample collections. We formulate both problems as sparse subset selection and develop two selection algorithms to recommend appropriate learners and shots, respectively. A matrix visualization and a scatterplot are combined to explain the recommended learners and shots in context and facilitate users in adjusting them. Based on the adjustment, the algorithm updates the recommendation results for another round of improvement. Two case studies are conducted to demonstrate that FSLDiagnotor helps build a few-shot classifier efficiently and increases the accuracy by 12% and 21%, respectively.  ( 2 min )
    Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization. (arXiv:2206.04496v1 [cs.LG])
    A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.  ( 2 min )
    There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning for Mazes. (arXiv:2206.04266v1 [cs.LG])
    Interpretability is an essential building block for trustworthiness in reinforcement learning systems. However, interpretability might come at the cost of deteriorated performance, leading many researchers to build complex models. Our goal is to analyze the cost of interpretability. We show that in certain cases, one can achieve policy interpretability while maintaining its optimality. We focus on a classical problem from reinforcement learning: mazes with $k$ obstacles in $\mathbb{R}^d$. We prove the existence of a small decision tree with a linear function at each inner node and depth $O(\log k + 2^d)$ that represents an optimal policy. Note that for the interesting case of a constant $d$, we have $O(\log k)$ depth. Thus, in this setting, there is no accuracy-interpretability tradeoff. To prove this result, we use a new "compressing" technique that might be useful in additional settings.  ( 2 min )
    DiSparse: Disentangled Sparsification for Multitask Model Compression. (arXiv:2206.04662v1 [cs.CV])
    Despite the popularity of Model Compression and Multitask Learning, how to effectively compress a multitask model has been less thoroughly analyzed due to the challenging entanglement of tasks in the parameter space. In this paper, we propose DiSparse, a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme. We consider each task independently by disentangling the importance measurement and take the unanimous decisions among all tasks when performing parameter pruning and selection. Our experimental results demonstrate superior performance on various configurations and settings compared to popular sparse training and pruning methods. Besides the effectiveness in compression, DiSparse also provides a powerful tool to the multitask learning community. Surprisingly, we even observed better performance than some dedicated multitask learning methods in several cases despite the high model sparsity enforced by DiSparse. We analyzed the pruning masks generated with DiSparse and observed strikingly similar sparse network architecture identified by each task even before the training starts. We also observe the existence of a "watershed" layer where the task relatedness sharply drops, implying no benefits in continued parameters sharing. Our code and models will be available at: https://github.com/SHI-Labs/DiSparse-Multitask-Model-Compression.  ( 2 min )
    ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret. (arXiv:2206.04122v1 [cs.GT])
    Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability in the tabular case. We show that the variance of the estimated regret of a tabular version of ESCHER with an oracle value function is significantly lower than that of outcome sampling MCCFR and tabular DREAM with an oracle value function. We then show that a deep learning version of ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- and the difference becomes dramatic as game size increases.  ( 2 min )
    Denoising Diffusion Implicit Models. (arXiv:2010.02502v3 [cs.LG] UPDATED)
    Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.  ( 2 min )
    Unsupervised Pre-Training on Patient Population Graphs for Patient-Level Predictions. (arXiv:2203.12616v2 [cs.LG] UPDATED)
    Pre-training has shown success in different areas of machine learning, such as Computer Vision (CV), Natural Language Processing (NLP) and medical imaging. However, it has not been fully explored for clinical data analysis. Even though an immense amount of Electronic Health Record (EHR) data is recorded, data and labels can be scarce if the data is collected in small hospitals or deals with rare diseases. In such scenarios, pre-training on a larger set of EHR data could improve the model performance. In this paper, we apply unsupervised pre-training to heterogeneous, multi-modal EHR data for patient outcome prediction. To model this data, we leverage graph deep learning over population graphs. We first design a network architecture based on graph transformer designed to handle various input feature types occurring in EHR data, like continuous, discrete, and time-series features, allowing better multi-modal data fusion. Further, we design pre-training methods based on masked imputation to pre-train our network before fine-tuning on different end tasks. Pre-training is done in a fully unsupervised fashion, which lays the groundwork for pre-training on large public datasets with different tasks and similar modalities in the future. We test our method on two medical datasets of patient records, TADPOLE and MIMIC-III, including imaging and non-imaging features and different prediction tasks. We find that our proposed graph based pre-training method helps in modeling the data at a population level and further improves performance on the fine tuning tasks in terms of AUC on average by 4.15% for MIMIC and 7.64% for TADPOLE.  ( 2 min )
    GCVAE: Generalized-Controllable Variational AutoEncoder. (arXiv:2206.04225v1 [stat.ML])
    Variational autoencoders (VAEs) have recently been used for unsupervised disentanglement learning of complex density distributions. Numerous variants exist to encourage disentanglement in latent space while improving reconstruction. However, none have simultaneously managed the trade-off between attaining extremely low reconstruction error and a high disentanglement score. We present a generalized framework to handle this challenge under constrained optimization and demonstrate that it outperforms state-of-the-art existing models as regards disentanglement while balancing reconstruction. We introduce three controllable Lagrangian hyperparameters to control reconstruction loss, KL divergence loss and correlation measure. We prove that maximizing information in the reconstruction network is equivalent to information maximization during amortized inference under reasonable assumptions and constraint relaxation.  ( 2 min )
    Balanced background and explanation data are needed in explaining deep learning models with SHAP: An empirical study on clinical decision making. (arXiv:2206.04050v1 [cs.LG])
    Objective: Shapley additive explanations (SHAP) is a popular post-hoc technique for explaining black box models. While the impact of data imbalance on predictive models has been extensively studied, it remains largely unknown with respect to SHAP-based model explanations. This study sought to investigate the effects of data imbalance on SHAP explanations for deep learning models, and to propose a strategy to mitigate these effects. Materials and Methods: We propose to adjust class distributions in the background and explanation data in SHAP when explaining black box models. Our data balancing strategy is to compose background data and explanation data with an equal distribution of classes. To evaluate the effects of data adjustment on model explanation, we propose to use the beeswarm plot as a qualitative tool to identify "abnormal" explanation artifacts, and quantitatively test the consistency between variable importance and prediction power. We demonstrated our proposed approach in an empirical study that predicted inpatient mortality using the Medical Information Mart for Intensive Care (MIMIC-III) data and a multilayer perceptron. Results: Using the data balancing strategy would allow us to reduce the number of the artifacts in the beeswarm plot, thus mitigating the negative effects of data imbalance. Additionally, with the balancing strategy, the top-ranked variables from the corresponding importance ranking demonstrated improved discrimination power. Discussion and Conclusion: Our findings suggest that balanced background and explanation data could help reduce the noise in explanation results induced by skewed data distribution and improve the reliability of variable importance ranking. Furthermore, these balancing procedures improve the potential of SHAP in identifying patients with abnormal characteristics in clinical applications.  ( 2 min )
    Automatic Debiased Machine Learning for Dynamic Treatment Effects and General Nested Functionals. (arXiv:2203.13887v3 [econ.EM] UPDATED)
    We extend the idea of automated debiased machine learning to the dynamic treatment regime and more generally to nested functionals. We show that the multiply robust formula for the dynamic treatment regime with discrete treatments can be re-stated in terms of a recursive Riesz representer characterization of nested mean regressions. We then apply a recursive Riesz representer estimation learning algorithm that estimates de-biasing corrections without the need to characterize how the correction terms look like, such as for instance, products of inverse probability weighting terms, as is done in prior work on doubly robust estimation in the dynamic regime. Our approach defines a sequence of loss minimization problems, whose minimizers are the mulitpliers of the de-biasing correction, hence circumventing the need for solving auxiliary propensity models and directly optimizing for the mean squared error of the target de-biasing correction. We provide further applications of our approach to estimation of dynamic discrete choice models.  ( 2 min )
    Deep Surrogate Assisted Generation of Environments. (arXiv:2206.04199v1 [cs.AI])
    Recent progress in reinforcement learning (RL) has started producing generally capable agents that can solve a distribution of complex environments. These agents are typically tested on fixed, human-authored environments. On the other hand, quality diversity (QD) optimization has been proven to be an effective component of environment generation algorithms, which can generate collections of high-quality environments that are diverse in the resulting agent behaviors. However, these algorithms require potentially expensive simulations of agents on newly generated environments. We propose Deep Surrogate Assisted Generation of Environments (DSAGE), a sample-efficient QD environment generation algorithm that maintains a deep surrogate model for predicting agent behaviors in new environments. Results in two benchmark domains show that DSAGE significantly outperforms existing QD environment generation algorithms in discovering collections of environments that elicit diverse behaviors of a state-of-the-art RL agent and a planning agent.  ( 2 min )
    Choosing Answers in $\varepsilon$-Best-Answer Identification for Linear Bandits. (arXiv:2206.04456v1 [stat.ML])
    In pure-exploration problems, information is gathered sequentially to answer a question on the stochastic environment. While best-arm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\varepsilon$-close to the best one (and not exactly the best one). In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct. We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity. Instead, a \textit{furthest answer} should be identified. Using that insight to choose the candidate answer carefully, we develop a simple procedure to adapt best-arm identification algorithms to tackle $\varepsilon$-best-answer identification in transductive linear stochastic bandits. Finally, we propose an asymptotically optimal algorithm for this setting, which is shown to achieve competitive empirical performance against existing modified best-arm identification algorithms.  ( 2 min )
    Factuality Enhanced Language Models for Open-Ended Text Generation. (arXiv:2206.04624v1 [cs.CL])
    Pretrained language models (LMs) are susceptible to generate text with nonfactual information. In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation. We design the FactualityPrompts test set and metrics to measure the factuality of LM generations. Based on that, we study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B. Interestingly, we find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions. In addition, popular sampling algorithms (e.g., top-p) in open-ended text generation can harm the factuality due to the "uniform randomness" introduced at every sampling step. We propose the factual-nucleus sampling algorithm that dynamically adapts the randomness to improve the factuality of generation while maintaining quality. Furthermore, we analyze the inefficiencies of the standard training method in learning correct associations between entities from factual text corpus (e.g., Wikipedia). We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion as the training objective, which can vastly reduce the factual errors.  ( 2 min )
    VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution. (arXiv:2206.04647v1 [eess.IV])
    Videos typically record the streaming and continuous visual data as discrete consecutive frames. Since the storage cost is expensive for videos of high fidelity, most of them are stored in a relatively low resolution and frame rate. Recent works of Space-Time Video Super-Resolution (STVSR) are developed to incorporate temporal interpolation and spatial super-resolution in a unified framework. However, most of them only support a fixed up-sampling scale, which limits their flexibility and applications. In this work, instead of following the discrete representations, we propose Video Implicit Neural Representation (VideoINR), and we show its applications for STVSR. The learned implicit neural representation can be decoded to videos of arbitrary spatial resolution and frame rate. We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales and significantly outperforms prior works on continuous and out-of-training-distribution scales. Our project page is at this http URL .
    Reinforced Inverse Scattering. (arXiv:2206.04186v1 [cs.LG])
    Inverse wave scattering aims at determining the properties of an object using data on how the object scatters incoming waves. In order to collect information, sensors are put in different locations to send and receive waves from each other. The choice of sensor positions and incident wave frequencies determines the reconstruction quality of scatterer properties. This paper introduces reinforcement learning to develop precision imaging that decides sensor positions and wave frequencies adaptive to different scatterers in an intelligent way, thus obtaining a significant improvement in reconstruction quality with limited imaging resources. Extensive numerical results will be provided to demonstrate the superiority of the proposed method over existing methods.
    Learning Invariant Representations with Missing Data. (arXiv:2112.00881v2 [cs.LG] UPDATED)
    Spurious correlations allow flexible models to predict well during training but poorly on related test distributions. Recent work has shown that models that satisfy particular independencies involving correlation-inducing \textit{nuisance} variables have guarantees on their test performance. Enforcing such independencies requires nuisances to be observed during training. However, nuisances, such as demographics or image background labels, are often missing. Enforcing independence on just the observed data does not imply independence on the entire population. Here we derive \acrshort{mmd} estimators used for invariance objectives under missing nuisances. On simulations and clinical data, optimizing through these estimates achieves test performance similar to using estimators that make use of the full data.
    Towards Understanding Graph Neural Networks: An Algorithm Unrolling Perspective. (arXiv:2206.04471v1 [cs.LG])
    The graph neural network (GNN) has demonstrated its superior performance in various applications. The working mechanism behind it, however, remains mysterious. GNN models are designed to learn effective representations for graph-structured data, which intrinsically coincides with the principle of graph signal denoising (GSD). Algorithm unrolling, a "learning to optimize" technique, has gained increasing attention due to its prospects in building efficient and interpretable neural network architectures. In this paper, we introduce a class of unrolled networks built based on truncated optimization algorithms (e.g., gradient descent and proximal gradient descent) for GSD problems. They are shown to be tightly connected to many popular GNN models in that the forward propagations in these GNNs are in fact unrolled networks serving specific GSDs. Besides, the training process of a GNN model can be seen as solving a bilevel optimization problem with a GSD problem at the lower level. Such a connection brings a fresh view of GNNs, as we could try to understand their practical capabilities from their GSD counterparts, and it can also motivate designing new GNN models. Based on the algorithm unrolling perspective, an expressive model named UGDGNN, i.e., unrolled gradient descent GNN, is further proposed which inherits appealing theoretical properties. Extensive numerical simulations on seven benchmark datasets demonstrate that UGDGNN can achieve superior or competitive performance over the state-of-the-art models.
    HideNseek: Federated Lottery Ticket via Server-side Pruning and Sign Supermask. (arXiv:2206.04385v1 [cs.LG])
    Federated learning alleviates the privacy risk in distributed learning by transmitting only the local model updates to the central server. However, it faces challenges including statistical heterogeneity of clients' datasets and resource constraints of client devices, which severely impact the training performance and user experience. Prior works have tackled these challenges by combining personalization with model compression schemes including quantization and pruning. However, the pruning is data-dependent and thus must be done on the client side which requires considerable computation cost. Moreover, the pruning normally trains a binary supermask $\in \{0, 1\}$ which significantly limits the model capacity yet with no computation benefit. Consequently, the training requires high computation cost and a long time to converge while the model performance does not pay off. In this work, we propose HideNseek which employs one-shot data-agnostic pruning at initialization to get a subnetwork based on weights' synaptic saliency. Each client then optimizes a sign supermask $\in \{-1, +1\}$ multiplied by the unpruned weights to allow faster convergence with the same compression rates as state-of-the-art. Empirical results from three datasets demonstrate that compared to state-of-the-art, HideNseek improves inferences accuracies by up to 40.6\% while reducing the communication cost and training time by up to 39.7\% and 46.8\% respectively.
    A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning. (arXiv:2206.04551v1 [cs.LG])
    The generalization of model-based reinforcement learning (MBRL) methods to environments with unseen transition dynamics is an important yet challenging problem. Existing methods try to extract environment-specified information $Z$ from past transition segments to make the dynamics prediction model generalizable to different dynamics. However, because environments are not labelled, the extracted information inevitably contains redundant information unrelated to the dynamics in transition segments and thus fails to maintain a crucial property of $Z$: $Z$ should be similar in the same environment and dissimilar in different ones. As a result, the learned dynamics prediction function will deviate from the true one, which undermines the generalization ability. To tackle this problem, we introduce an interventional prediction module to estimate the probability of two estimated $\hat{z}_i, \hat{z}_j$ belonging to the same environment. Furthermore, by utilizing the $Z$'s invariance within a single environment, a relational head is proposed to enforce the similarity between $\hat{{Z}}$ from the same environment. As a result, the redundant information will be reduced in $\hat{Z}$. We empirically show that $\hat{{Z}}$ estimated by our method enjoy less redundant information than previous methods, and such $\hat{{Z}}$ can significantly reduce dynamics prediction errors and improve the performance of model-based RL methods on zero-shot new environments with unseen dynamics. The codes of this method are available at \url{https://github.com/CR-Gjx/RIA}.
    Distillation Decision Tree. (arXiv:2206.04661v1 [stat.ME])
    Black-box machine learning models are criticized as lacking interpretability, although they tend to have good prediction accuracy. Knowledge Distillation (KD) is an emerging tool to interpret the black-box model by distilling its knowledge into a transparent model. With well-known advantages in interpretation, decision tree is a competitive candidate of the transparent model. However, theoretical or empirical understanding for the decision tree generated from KD process is limited. In this paper, we name this kind of decision tree the distillation decision tree (DDT) and lay the theoretical foundations for tree structure stability which determines the validity of DDT's interpretation. We prove that the structure of DDT can achieve stable (convergence) under some mild assumptions. Meanwhile, we develop algorithms for stabilizing the induction of DDT, propose parallel strategies for improving algorithm's computational efficiency, and introduce a marginal principal component analysis method for overcoming the curse of dimensionality in sampling. Simulated and real data studies justify our theoretical results, validate the efficacy of algorithms, and demonstrate that DDT can strike a good balance between model's prediction accuracy and interpretability.
    Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Challenges and Solutions. (arXiv:2107.09546v2 [cs.LG] UPDATED)
    Machine learning is expected to fuel significant improvements in medical care. To ensure that fundamental principles such as beneficence, respect for human autonomy, prevention of harm, justice, privacy, and transparency are respected, medical machine learning systems must be developed responsibly. Many high-level declarations of ethical principles have been put forth for this purpose, but there is a severe lack of technical guidelines explicating the practical consequences for medical machine learning. Similarly, there is currently considerable uncertainty regarding the exact regulatory requirements placed upon medical machine learning systems. This survey provides an overview of the technical and procedural challenges involved in creating medical machine learning systems responsibly and in conformity with existing regulations, as well as possible solutions to address these challenges. First, a brief review of existing regulations affecting medical machine learning is provided, showing that properties such as safety, robustness, reliability, privacy, security, transparency, explainability, and nondiscrimination are all demanded already by existing law and regulations - albeit, in many cases, to an uncertain degree. Next, the key technical obstacles to achieving these desirable properties are discussed, as well as important techniques to overcome these obstacles in the medical context. We notice that distribution shift, spurious correlations, model underspecification, uncertainty quantification, and data scarcity represent severe challenges in the medical context. Promising solution approaches include the use of large and representative datasets and federated learning as a means to that end, the careful exploitation of domain knowledge, the use of inherently transparent models, comprehensive out-of-distribution model testing and verification, as well as algorithmic impact assessments.
    RecoMed: A Knowledge-Aware Recommender System for Hypertension Medications. (arXiv:2201.05461v2 [cs.IR] UPDATED)
    Background and Objective High medicine diversity has always been a significant challenge for prescription, causing confusion or doubt in physicians' decision-making process. This paper aims to develop a medicine recommender system called RecoMed to aid the physician in the prescription process of hypertension by providing information about what medications have been prescribed by other doctors and figuring out what other medicines can be recommended in addition to the one in question. Methods There are two steps to the developed method: First, association rule mining algorithms are employed to find medicine association rules. The second step entails graph mining and clustering to present an enriched recommendation via ATC code, which itself comprises several steps. First, the initial graph is constructed from historical prescription data. Then, data pruning is performed in the second step, after which the medicines with a high repetition rate are removed at the discretion of a general medical practitioner. Next, the medicines are matched to a well-known medicine classification system called the ATC code to provide an enriched recommendation. And finally, the DBSCAN and Louvain algorithms cluster medicines in the final step. Results A list of recommended medicines is provided as the system's output, and physicians can choose one or more of the medicines based on the patient's clinical symptoms. Only the medicines of class 2, related to high blood pressure medications, are used to assess the system's performance. The results obtained from this system have been reviewed and confirmed by an expert in this field.
    Russian Texts Detoxification with Levenshtein Editing. (arXiv:2204.13638v2 [cs.CL] UPDATED)
    Text detoxification is a style transfer task of creating neutral versions of toxic texts. In this paper, we use the concept of text editing to build a two-step tagging-based detoxification model using a parallel corpus of Russian texts. With this model, we achieved the best style transfer accuracy among all models in the RUSSE Detox shared task, surpassing larger sequence-to-sequence models.
    Regret Bounds for Information-Directed Reinforcement Learning. (arXiv:2206.04640v1 [cs.LG])
    Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.
    DORA: Exploring outlier representations in Deep Neural Networks. (arXiv:2206.04530v1 [cs.LG])
    Deep Neural Networks (DNNs) draw their power from the representations they learn. In recent years, however, researchers have found that DNNs, while being incredibly effective in learning complex abstractions, also tend to be infected with artifacts, such as biases, Clever Hanses (CH), or Backdoors, due to spurious correlations inherent in the training data. So far, existing methods for uncovering such artifactual and malicious behavior in trained models focus on finding artifacts in the input data, which requires both availabilities of a data set and human intervention. In this paper, we introduce DORA (Data-agnOstic Representation Analysis): the first automatic data-agnostic method for the detection of potentially infected representations in Deep Neural Networks. We further show that contaminated representations found by DORA can be used to detect infected samples in any given dataset. We qualitatively and quantitatively evaluate the performance of our proposed method in both, controlled toy scenarios, and in real-world settings, where we demonstrate the benefit of DORA in safety-critical applications.
    Model Degradation Hinders Deep Graph Neural Networks. (arXiv:2206.04361v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved great success in various graph mining tasks.However, drastic performance degradation is always observed when a GNN is stacked with many layers. As a result, most GNNs only have shallow architectures, which limits their expressive power and exploitation of deep neighborhoods.Most recent studies attribute the performance degradation of deep GNNs to the \textit{over-smoothing} issue. In this paper, we disentangle the conventional graph convolution operation into two independent operations: \textit{Propagation} (\textbf{P}) and \textit{Transformation} (\textbf{T}).Following this, the depth of a GNN can be split into the propagation depth ($D_p$) and the transformation depth ($D_t$). Through extensive experiments, we find that the major cause for the performance degradation of deep GNNs is the \textit{model degradation} issue caused by large $D_t$ rather than the \textit{over-smoothing} issue mainly caused by large $D_p$. Further, we present \textit{Adaptive Initial Residual} (AIR), a plug-and-play module compatible with all kinds of GNN architectures, to alleviate the \textit{model degradation} issue and the \textit{over-smoothing} issue simultaneously. Experimental results on six real-world datasets demonstrate that GNNs equipped with AIR outperform most GNNs with shallow architectures owing to the benefits of both large $D_p$ and $D_t$, while the time costs associated with AIR can be ignored.
    Understanding the unstable convergence of gradient descent. (arXiv:2204.01050v2 [math.OC] UPDATED)
    Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from first principles, and discuss key causes behind it. We also identify its main characteristics, and how they interrelate based on both theory and experiments, offering a principled view toward understanding the phenomenon.  ( 2 min )
    Robust Inverse Framework using Knowledge-guided Self-Supervised Learning: An application to Hydrology. (arXiv:2109.06429v2 [cs.LG] UPDATED)
    Machine Learning is beginning to provide state-of-the-art performance in a range of environmental applications such as streamflow prediction in a hydrologic basin. However, building accurate broad-scale models for streamflow remains challenging in practice due to the variability in the dominant hydrologic processes, which are best captured by sets of process-related basin characteristics. Existing basin characteristics suffer from noise and uncertainty, among many other things, which adversely impact model performance. To tackle the above challenges, in this paper, we propose a novel Knowledge-guided Self-Supervised Learning (KGSSL) inverse framework to extract system characteristics from driver and response data. This first-of-its-kind framework achieves robust performance even when characteristics are corrupted. We show that KGSSL achieves state-of-the-art results for streamflow modeling for CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) which is a widely used hydrology benchmark dataset. Specifically, KGSSL outperforms other methods by up to 16 \% in reconstructing characteristics. Furthermore, we show that KGSSL is relatively more robust to distortion than baseline methods, and outperforms the baseline model by 35\% when plugging in KGSSL inferred characteristics.  ( 2 min )
    Graph Attention MLP with Reliable Label Utilization. (arXiv:2108.10097v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have recently achieved state-of-the-art performance in many graph-based applications. Despite the high expressive power, they typically need to perform an expensive recursive neighborhood expansion in multiple training epochs and face a scalability issue. Moreover, most of them are inflexible since they are restricted to fixed-hop neighborhoods and insensitive to actual receptive field demands for different nodes. We circumvent these limitations by introducing a scalable and flexible Graph Attention Multilayer Perceptron (GAMLP). With the separation of the non-linear transformation and feature propagation, GAMLP significantly improves the scalability and efficiency by performing the propagation procedure in a pre-compute manner. With three principled receptive field attention, each node in GAMLP is flexible and adaptive in leveraging the propagated features over the different sizes of reception field. We conduct extensive evaluations on the three large open graph benchmarks (e.g., ogbn-papers100M, ogbn-products and ogbn-mag), demonstrating that GAMLP not only achieves the state-of-art performance, but also additionally provide high scalability and efficiency.  ( 2 min )
    Study of Feature Importance for Quantum Machine Learning Models. (arXiv:2202.11204v4 [quant-ph] UPDATED)
    Predictor importance is a crucial part of data preprocessing pipelines in classical and quantum machine learning (QML). This work presents the first study of its kind in which feature importance for QML models has been explored and contrasted against their classical machine learning (CML) equivalents. We developed a hybrid quantum-classical architecture where QML models are trained and feature importance values are calculated from classical algorithms on a real-world dataset. This architecture has been implemented on ESPN Fantasy Football data using Qiskit statevector simulators and IBM quantum hardware such as the IBMQ Mumbai and IBMQ Montreal systems. Even though we are in the Noisy Intermediate-Scale Quantum (NISQ) era, the physical quantum computing results are promising. To facilitate current quantum scale, we created a data tiering, model aggregation, and novel validation methods. Notably, the feature importance magnitudes from the quantum models had a much higher variation when contrasted to classical models. We can show that equivalent QML and CML models are complementary through diversity measurements. The diversity between QML and CML demonstrates that both approaches can contribute to a solution in different ways. Within this paper we focus on Quantum Support Vector Classifiers (QSVC), Variational Quantum Circuit (VQC), and their classical counterparts. The ESPN and IBM fantasy football Trade Assistant combines advanced statistical analysis with the natural language processing of Watson Discovery to serve up personalized trade recommendations that are fair. Here, player valuation data of each player has been considered and this work can be extended to calculate the feature importance of other QML models such as Quantum Boltzmann machines.  ( 2 min )
    A Psychological Theory of Explainability. (arXiv:2205.08452v2 [cs.AI] UPDATED)
    The goal of explainable Artificial Intelligence (XAI) is to generate human-interpretable explanations, but there are no computationally precise theories of how humans interpret AI generated explanations. The lack of theory means that validation of XAI must be done empirically, on a case-by-case basis, which prevents systematic theory-building in XAI. We propose a psychological theory of how humans draw conclusions from saliency maps, the most common form of XAI explanation, which for the first time allows for precise prediction of explainee inference conditioned on explanation. Our theory posits that absent explanation humans expect the AI to make similar decisions to themselves, and that they interpret an explanation by comparison to the explanations they themselves would give. Comparison is formalized via Shepard's universal law of generalization in a similarity space, a classic theory from cognitive science. A pre-registered user study on AI image classifications with saliency map explanations demonstrate that our theory quantitatively matches participants' predictions of the AI.  ( 2 min )
    Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models. (arXiv:2206.04589v1 [cs.DS])
    We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(\epsilon \sqrt{\log(1/\epsilon)})$. Similarly, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(\epsilon \log(1/\epsilon))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.  ( 2 min )
    Strategic Instrumental Variable Regression: Recovering Causal Relationships From Strategic Responses. (arXiv:2107.05762v3 [cs.LG] UPDATED)
    In settings where Machine Learning (ML) algorithms automate or inform consequential decisions about people, individual decision subjects are often incentivized to strategically modify their observable attributes to receive more favorable predictions. As a result, the distribution the assessment rule is trained on may differ from the one it operates on in deployment. While such distribution shifts, in general, can hinder accurate predictions, our work identifies a unique opportunity associated with shifts due to strategic responses: We show that we can use strategic responses effectively to recover causal relationships between the observable features and outcomes we wish to predict, even under the presence of unobserved confounding variables. Specifically, our work establishes a novel connection between strategic responses to ML models and instrumental variable (IV) regression by observing that the sequence of deployed models can be viewed as an instrument that affects agents' observable features but does not directly influence their outcomes. We show that our causal recovery method can be utilized to improve decision-making across several important criteria: individual fairness, agent outcomes, and predictive risk. In particular, we show that if decision subjects differ in their ability to modify non-causal attributes, any decision rule deviating from the causal coefficients can lead to (potentially unbounded) individual-level unfairness.  ( 2 min )
    Overcoming the Spectral Bias of Neural Value Approximation. (arXiv:2206.04672v1 [cs.LG])
    Value approximation using deep neural networks is at the heart of off-policy deep reinforcement learning, and is often the primary module that provides learning signals to the rest of the algorithm. While multi-layer perceptron networks are universal function approximators, recent works in neural kernel regression suggest the presence of a spectral bias, where fitting high-frequency components of the value function requires exponentially more gradient update steps than the low-frequency ones. In this work, we re-examine off-policy reinforcement learning through the lens of kernel regression and propose to overcome such bias via a composite neural tangent kernel. With just a single line-change, our approach, the Fourier feature networks (FFN) produce state-of-the-art performance on challenging continuous control domains with only a fraction of the compute. Faster convergence and better off-policy stability also make it possible to remove the target network without suffering catastrophic divergences, which further reduces TD}(0)'s estimation bias on a few tasks.  ( 2 min )
    Learning to generalize Dispatching rules on the Job Shop Scheduling. (arXiv:2206.04423v1 [cs.LG])
    This paper introduces a Reinforcement Learning approach to better generalize heuristic dispatching rules on the Job-shop Scheduling Problem (JSP). Current models on the JSP do not focus on generalization, although, as we show in this work, this is key to learning better heuristics on the problem. A well-known technique to improve generalization is to learn on increasingly complex instances using Curriculum Learning (CL). However, as many works in the literature indicate, this technique might suffer from catastrophic forgetting when transferring the learned skills between different problem sizes. To address this issue, we introduce a novel Adversarial Curriculum Learning (ACL) strategy, which dynamically adjusts the difficulty level during the learning process to revisit the worst-performing instances. This work also presents a deep learning model to solve the JSP, which is equivariant w.r.t. the job definition and size-agnostic. Conducted experiments on Taillard's and Demirkol's instances show that the presented approach significantly improves the current state-of-the-art models on the JSP. It reduces the average optimality gap from 19.35\% to 10.46\% on Taillard's instances and from 38.43\% to 18.85\% on Demirkol's instances. Our implementation is available online.  ( 2 min )
    Multi-modal Attention Network for Stock Movements Prediction. (arXiv:2112.13593v3 [cs.LG] UPDATED)
    Stock prices move as piece-wise trending fluctuation rather than a purely random walk. Traditionally, the prediction of future stock movements is based on the historical trading record. Nowadays, with the development of social media, many active participants in the market choose to publicize their strategies, which provides a window to glimpse over the whole market's attitude towards future movements by extracting the semantics behind social media. However, social media contains conflicting information and cannot replace historical records completely. In this work, we propose a multi-modality attention network to reduce conflicts and integrate semantic and numeric features to predict future stock movements comprehensively. Specifically, we first extract semantic information from social media and estimate their credibility based on posters' identity and public reputation. Then we incorporate the semantic from online posts and numeric features from historical records to make the trading strategy. Experimental results show that our approach outperforms previous methods by a significant margin in both prediction accuracy (61.20\%) and trading profits (9.13\%). It demonstrates that our method improves the performance of stock movements prediction and informs future research on multi-modality fusion towards stock prediction.  ( 2 min )
    AttX: Attentive Cross-Connections for Fusion of Wearable Signals in Emotion Recognition. (arXiv:2206.04625v1 [cs.LG])
    We propose cross-modal attentive connections, a new dynamic and effective technique for multimodal representation learning from wearable data. Our solution can be integrated into any stage of the pipeline, i.e., after any convolutional layer or block, to create intermediate connections between individual streams responsible for processing each modality. Additionally, our method benefits from two properties. First, it can share information uni-directionally (from one modality to the other) or bi-directionally. Second, it can be integrated into multiple stages at the same time to further allow network gradients to be exchanged in several touch-points. We perform extensive experiments on three public multimodal wearable datasets, WESAD, SWELL-KW, and CASE, and demonstrate that our method can effectively regulate and share information between different modalities to learn better representations. Our experiments further demonstrate that once integrated into simple CNN-based multimodal solutions (2, 3, or 4 modalities), our method can result in superior or competitive performance to state-of-the-art and outperform a variety of baseline uni-modal and classical multimodal methods.  ( 2 min )
    Neo-GNNs: Neighborhood Overlap-aware Graph Neural Networks for Link Prediction. (arXiv:2206.04216v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely applied to various fields for learning over graph-structured data. They have shown significant improvements over traditional heuristic methods in various tasks such as node classification and graph classification. However, since GNNs heavily rely on smoothed node features rather than graph structure, they often show poor performance than simple heuristic methods in link prediction where the structural information, e.g., overlapped neighborhoods, degrees, and shortest paths, is crucial. To address this limitation, we propose Neighborhood Overlap-aware Graph Neural Networks (Neo-GNNs) that learn useful structural features from an adjacency matrix and estimate overlapped neighborhoods for link prediction. Our Neo-GNNs generalize neighborhood overlap-based heuristic methods and handle overlapped multi-hop neighborhoods. Our extensive experiments on Open Graph Benchmark datasets (OGB) demonstrate that Neo-GNNs consistently achieve state-of-the-art performance in link prediction. Our code is publicly available at https://github.com/seongjunyun/Neo_GNNs.
    Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation. (arXiv:2203.15041v2 [cs.RO] UPDATED)
    Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially compliant navigation behaviors on these robots becomes critical to ensuring safe and comfortable human robot coexistence. To address this challenge, imitation learning is a promising framework, since it is easier for humans to demonstrate the task of social navigation rather than to formulate reward functions that accurately capture the complex multi objective setting of social navigation. The use of imitation learning and inverse reinforcement learning to social navigation for mobile robots, however, is currently hindered by a lack of large scale datasets that capture socially compliant robot navigation demonstrations in the wild. To fill this gap, we introduce Socially CompliAnt Navigation Dataset (SCAND) a large scale, first person view dataset of socially compliant navigation demonstrations. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations that comprises multi modal data streams including 3D lidar, joystick commands, odometry, visual and inertial information, collected on two morphologically different mobile robots a Boston Dynamics Spot and a Clearpath Jackal by four different human demonstrators in both indoor and outdoor environments. We additionally perform preliminary analysis and validation through real world robot experiments and show that navigation policies learned by imitation learning on SCAND generate socially compliant behaviors
    Learning in Distributed Contextual Linear Bandits Without Sharing the Context. (arXiv:2206.04180v1 [cs.LG])
    Contextual linear bandits is a rich and theoretically important model that has many practical applications. Recently, this setup gained a lot of interest in applications over wireless where communication constraints can be a performance bottleneck, especially when the contexts come from a large $d$-dimensional space. In this paper, we consider a distributed memoryless contextual linear bandit learning problem, where the agents who observe the contexts and take actions are geographically separated from the learner who performs the learning while not seeing the contexts. We assume that contexts are generated from a distribution and propose a method that uses $\approx 5d$ bits per context for the case of unknown context distribution and $0$ bits per context if the context distribution is known, while achieving nearly the same regret bound as if the contexts were directly observable. The former bound improves upon existing bounds by a $\log(T)$ factor, where $T$ is the length of the horizon, while the latter achieves information theoretical tightness.
    Generative Flow Networks for Discrete Probabilistic Modeling. (arXiv:2202.01361v2 [cs.LG] UPDATED)
    We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks. Code is publicly available at https://github.com/zdhNarsil/EB_GFN.
    What Makes Transfer Learning Work For Medical Images: Feature Reuse & Other Factors. (arXiv:2203.01825v2 [cs.LG] UPDATED)
    Transfer learning is a standard technique to transfer knowledge from one domain to another. For applications in medical imaging, transfer from ImageNet has become the de-facto approach, despite differences in the tasks and image characteristics between the domains. However, it is unclear what factors determine whether - and to what extent - transfer learning to the medical domain is useful. The long-standing assumption that features from the source domain get reused has recently been called into question. Through a series of experiments on several medical image benchmark datasets, we explore the relationship between transfer learning, data size, the capacity and inductive bias of the model, as well as the distance between the source and target domain. Our findings suggest that transfer learning is beneficial in most cases, and we characterize the important role feature reuse plays in its success.
    Quick survey of graph-based fraud detection methods. (arXiv:1910.11299v3 [cs.LG] CROSS LISTED)
    In general, anomaly detection is the problem of distinguishing between normal data samples with well defined patterns or signatures and those that do not conform to the expected profiles. Financial transactions, customer reviews, social media posts are all characterized by relational information. In these networks, fraudulent behaviour may appear as a distinctive graph edge, such as spam message, a node or a larger subgraph structure, such as when a group of clients engage in money laundering schemes. Most commonly, these networks are represented as attributed graphs, with numerical features complementing relational information. We present a survey on anomaly detection techniques used for fraud detection that exploit both the graph structure underlying the data and the contextual information contained in the attributes.
    Robust Matrix Completion with Heavy-tailed Noise. (arXiv:2206.04276v1 [math.ST])
    This paper studies low-rank matrix completion in the presence of heavy-tailed and possibly asymmetric noise, where we aim to estimate an underlying low-rank matrix given a set of highly incomplete noisy entries. Though the matrix completion problem has attracted much attention in the past decade, there is still lack of theoretical understanding when the observations are contaminated by heavy-tailed noises. Prior theory falls short of explaining the empirical results and is unable to capture the optimal dependence of the estimation error on the noise level. In this paper, we adopt an adaptive Huber loss to accommodate heavy-tailed noise, which is robust against large and possibly asymmetric errors when the parameter in the loss function is carefully designed to balance the Huberization biases and robustness to outliers. Then, we propose an efficient nonconvex algorithm via a balanced low-rank Burer-Monteiro matrix factorization and gradient decent with robust spectral initialization. We prove that under merely bounded second moment condition on the error distributions, rather than the sub-Gaussian assumption, the Euclidean error of the iterates generated by the proposed algorithm decrease geometrically fast until achieving a minimax-optimal statistical estimation error, which has the same order as that in the sub-Gaussian case. The key technique behind this significant advancement is a powerful leave-one-out analysis framework. The theoretical results are corroborated by our simulation studies.
    Hilbert Curve Projection Distance for Distribution Comparison. (arXiv:2205.15059v2 [cs.LG] UPDATED)
    Distribution comparison plays a central role in many machine learning tasks like data classification and generative modeling. In this study, we propose a novel metric, called Hilbert curve projection (HCP) distance, to measure the distance between two probability distributions with high robustness and low complexity. In particular, we first project two high-dimensional probability densities using Hilbert curve to obtain a coupling between them, and then calculate the transport distance between these two densities in the original space, according to the coupling. We show that HCP distance is a proper metric and is well-defined for absolutely continuous probability measures. Furthermore, we demonstrate that the empirical HCP distance converges to its population counterpart at a rate of no more than $O(n^{-1/2d})$ under regularity conditions. To suppress the curse-of-dimensionality, we also develop two variants of the HCP distance using (learnable) subspace projections. Experiments on both synthetic and real-world data show that our HCP distance works as an effective surrogate of the Wasserstein distance with low complexity and overcomes the drawbacks of the sliced Wasserstein distance.
    Contextual Information-Directed Sampling. (arXiv:2205.10895v2 [cs.LG] UPDATED)
    Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient reinforcement learning algorithm. However, it is still unclear what is the right form of information ratio to optimize when contextual information is available. We investigate the IDS design through two contextual bandit problems: contextual bandits with graph feedback and sparse linear contextual bandits. We provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution. The main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditional IDS can be myopic. We further propose a computationally-efficient version of contextual IDS based on Actor-Critic and evaluate it empirically on a neural network contextual bandit.
    Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams. (arXiv:2202.08312v2 [cs.LG] UPDATED)
    Motivated by recent applications requiring differential privacy over adaptive streams, we investigate the question of optimal instantiations of the matrix mechanism in this setting. We prove fundamental theoretical results on the applicability of matrix factorizations to adaptive streams, and provide a parameter-free fixed-point algorithm for computing optimal factorizations. We instantiate this framework with respect to concrete matrices which arise naturally in machine learning, and train user-level differentially private models with the resulting optimal mechanisms, yielding significant improvements in a notable problem in federated learning with user-level differential privacy.
    Privacy-Aware Compression for Federated Data Analysis. (arXiv:2203.08134v2 [cs.LG] UPDATED)
    Federated data analytics is a framework for distributed data analysis where a server compiles noisy responses from a group of distributed low-bandwidth user devices to estimate aggregate statistics. Two major challenges in this framework are privacy, since user data is often sensitive, and compression, since the user devices have low network bandwidth. Prior work has addressed these challenges separately by combining standard compression algorithms with known privacy mechanisms. In this work, we take a holistic look at the problem and design a family of privacy-aware compression mechanisms that work for any given communication budget. We first propose a mechanism for transmitting a single real number that has optimal variance under certain conditions. We then show how to extend it to metric differential privacy for location privacy use-cases, as well as vectors, for application to federated learning. Our experiments illustrate that our mechanism can lead to better utility vs. compression trade-offs for the same privacy loss in a number of settings.
    Evaluating State of the Art, Forecasting Ensembles- and Meta-learning Strategies for Model Fusion. (arXiv:2203.03279v2 [cs.LG] UPDATED)
    Techniques of hybridisation and ensemble learning are popular model fusion techniques for improving the predictive power of forecasting methods. With limited research that instigates combining these two promising approaches, this paper focuses on the utility of the Exponential-Smoothing-Recurrent Neural Network (ES-RNN) in the pool of base models for different ensembles. We compare against some state of the art ensembling techniques and arithmetic model averaging as a benchmark. We experiment with the M4 forecasting data set of 100,000 time-series, and the results show that the Feature-based Forecast Model Averaging (FFORMA), on average, is the best technique for late data fusion with the ES-RNN. However, considering the M4's Daily subset of data, stacking was the only successful ensemble at dealing with the case where all base model performances are similar. Our experimental results indicate that we attain state of the art forecasting results compared to N-BEATS as a benchmark. We conclude that model averaging is a more robust ensemble than model selection and stacking strategies. Further, the results show that gradient boosting is superior for implementing ensemble learning strategies.
    Markovian Interference in Experiments. (arXiv:2206.02371v2 [cs.LG] UPDATED)
    We consider experiments in dynamical systems where interventions on some experimental units impact other units through a limiting constraint (such as a limited inventory). Despite outsize practical importance, the best estimators for this `Markovian' interference problem are largely heuristic in nature, and their bias is not well understood. We formalize the problem of inference in such experiments as one of policy evaluation. Off-policy estimators, while unbiased, apparently incur a large penalty in variance relative to state-of-the-art heuristics. We introduce an on-policy estimator: the Differences-In-Q's (DQ) estimator. We show that the DQ estimator can in general have exponentially smaller variance than off-policy evaluation. At the same time, its bias is second order in the impact of the intervention. This yields a striking bias-variance tradeoff so that the DQ estimator effectively dominates state-of-the-art alternatives. From a theoretical perspective, we introduce three separate novel techniques that are of independent interest in the theory of Reinforcement Learning (RL). Our empirical evaluation includes a set of experiments on a city-scale ride-hailing simulator.
    Time Delay Estimation of Traffic Congestion Propagation based on Transfer Entropy. (arXiv:2108.06717v2 [stat.ML] UPDATED)
    Considering how congestion will propagate in the near future, understanding traffic congestion propagation has become crucial in GPS navigation systems for providing users with a more accurate estimated time of arrival (ETA). However, providing the exact ETA during congestion is a challenge owing to the complex propagation process between roads and high uncertainty regarding the future behavior of the process. Recent studies have focused on finding frequent congestion propagation patterns and determining the propagation probabilities. By contrast, this study proposes a novel time delay estimation method for traffic congestion propagation between roads using lag-specific transfer entropy (TE). Nonlinear normalization with a sliding window is used to effectively reveal the causal relationship between the source and target time series in calculating the TE. Moreover, Markov bootstrap techniques were adopted to quantify the uncertainty in the time delay estimator. To the best of our knowledge, the time delay estimation method presented in this article is the first to determine the time delay between roads for any congestion propagation pattern. The proposed method was validated using simulated data as well as real user trajectory data obtained from a major GPS navigation system applied in South Korea.
    Vector Optimization with Stochastic Bandit Feedback. (arXiv:2110.12311v3 [cs.LG] UPDATED)
    We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of the Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study an ($\epsilon,\delta$)-PAC Pareto set identification problem where an evaluation of each design yields a noisy observation of the mean reward vector. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of {\em ordering complexity}, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We provide gap-dependent and worst-case lower bounds on the sample complexity and show that in the worst-case the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the na\"ive elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.
    TubeDETR: Spatio-Temporal Video Grounding with Transformers. (arXiv:2203.16434v2 [cs.CV] UPDATED)
    We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.
    Multi-task Self-distillation for Graph-based Semi-Supervised Learning. (arXiv:2112.01174v2 [cs.LG] UPDATED)
    Graph convolutional networks have made great progress in graph-based semi-supervised learning. Existing methods mainly assume that nodes connected by graph edges are prone to have similar attributes and labels, so that the features smoothed by local graph structures can reveal the class similarities. However, there often exist mismatches between graph structures and labels in many real-world scenarios, where the structures may propagate misleading features or labels that eventually affect the model performance. In this paper, we propose a multi-task self-distillation framework that injects self-supervised learning and self-distillation into graph convolutional networks to separately address the mismatch problem from the structure side and the label side. First, we formulate a self-supervision pipeline based on pre-text tasks to capture different levels of similarities in graphs. The feature extraction process is encouraged to capture more complex proximity by jointly optimizing the pre-text task and the target task. Consequently, the local feature aggregations are improved from the structure side. Second, self-distillation uses soft labels of the model itself as additional supervision, which has similar effects as label smoothing. The knowledge from the classification pipeline and the self-supervision pipeline is collectively distilled to improve the generalization ability of the model from the label side. Experiment results show that the proposed method obtains remarkable performance gains under several classic graph convolutional architectures.
    Objective-Based Hierarchical Clustering of Deep Embedding Vectors. (arXiv:2012.08466v2 [cs.LG] UPDATED)
    We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).
    Globally Optimal Algorithms for Fixed-Budged Best Arm Identification. (arXiv:2206.04646v1 [stat.ML])
    We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the optimal rate as a result of global optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the misidentification probability, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To deal with this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
    A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning. (arXiv:2206.04621v1 [cs.CR])
    We review the use of differential privacy (DP) for privacy protection in machine learning (ML). We show that, driven by the aim of preserving the accuracy of the learned models, DP-based ML implementations are so loose that they do not offer the ex ante privacy guarantees of DP. Instead, what they deliver is basically noise addition similar to the traditional (and often criticized) statistical disclosure control approach. Due to the lack of formal privacy guarantees, the actual level of privacy offered must be experimentally assessed ex post, which is done very seldom. In this respect, we present empirical results showing that standard anti-overfitting techniques in ML can achieve a better utility/privacy/efficiency trade-off than DP.
    Fast Hierarchical Games for Image Explanations. (arXiv:2104.06164v2 [cs.CV] UPDATED)
    As modern complex neural networks keep breaking records and solving harder problems, their predictions also become less and less intelligible. The current lack of interpretability often undermines the deployment of accurate machine learning tools in sensitive settings. In this work, we present a model-agnostic explanation method for image classification based on a hierarchical extension of Shapley coefficients--Hierarchical Shap (h-Shap)--that resolves some of the limitations of current approaches. Unlike other Shapley-based explanation methods, h-Shap is scalable and can be computed without the need of approximation. Under certain distributional assumptions, such as those common in multiple instance learning, h-Shap retrieves the exact Shapley coefficients with an exponential improvement in computational complexity. We compare our hierarchical approach with popular Shapley-based and non-Shapley-based methods on a synthetic dataset, a medical imaging scenario, and a general computer vision problem, showing that h-Shap outperforms the state of the art in both accuracy and runtime. Code and experiments are made publicly available.
    On the Parameter Combinations That Matter and on Those That do Not. (arXiv:2110.06717v2 [cs.LG] UPDATED)
    We present a data-driven approach to characterizing nonidentifiability of a model's parameters and illustrate it through dynamic as well as steady kinetic models. By employing Diffusion Maps and their extensions, we discover the minimal combinations of parameters required to characterize the output behavior of a chemical system: a set of effective parameters for the model. Furthermore, we introduce and use a Conformal Autoencoder Neural Network technique, as well as a kernel-based Jointly Smooth Function technique, to disentangle the redundant parameter combinations that do not affect the output behavior from the ones that do. We discuss the interpretability of our data-driven effective parameters, and demonstrate the utility of the approach both for behavior prediction and parameter estimation. In the latter task, it becomes important to describe level sets in parameter space that are consistent with a particular output behavior. We validate our approach on a model of multisite phosphorylation, where a reduced set of effective parameters (nonlinear combinations of the physical ones) has previously been established analytically.
    Multivariate feature ranking of gene expression data. (arXiv:2111.02357v4 [cs.LG] UPDATED)
    Gene expression datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods tend to be not applicable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, so they do not detect interactions between factors. In this paper we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which we have applied in three gene expression classification problems. We statistically prove that the proposed methods outperform the state of the art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as feature selection methods of attribute subset evaluation based on correlation and consistency with multi-objective evolutionary search strategy.
    The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training. (arXiv:2007.12826v3 [stat.ML] UPDATED)
    Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).
    On Margins and Generalisation for Voting Classifiers. (arXiv:2206.04607v1 [cs.LG])
    We study the generalisation properties of majority voting on finite ensembles of classifiers, proving margin-based generalisation bounds via the PAC-Bayes theory. These provide state-of-the-art guarantees on a number of classification tasks. Our central results leverage the Dirichlet posteriors studied recently by Zantedeschi et al. [2021] for training voting classifiers; in contrast to that work our bounds apply to non-randomised votes via the use of margins. Our contributions add perspective to the debate on the "margins theory" proposed by Schapire et al. [1998] for the generalisation of ensemble classifiers.
    Conformal Off-Policy Prediction in Contextual Bandits. (arXiv:2206.04405v1 [stat.ML])
    Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.
    FogAdapt: Self-Supervised Domain Adaptation for Semantic Segmentation of Foggy Images. (arXiv:2201.02588v3 [cs.CV] UPDATED)
    This paper presents FogAdapt, a novel approach for domain adaptation of semantic segmentation for dense foggy scenes. Although significant research has been directed to reduce the domain shift in semantic segmentation, adaptation to scenes with adverse weather conditions remains an open question. Large variations in the visibility of the scene due to weather conditions, such as fog, smog, and haze, exacerbate the domain shift, thus making unsupervised adaptation in such scenarios challenging. We propose a self-entropy and multi-scale information augmented self-supervised domain adaptation method (FogAdapt) to minimize the domain shift in foggy scenes segmentation. Supported by the empirical evidence that an increase in fog density results in high self-entropy for segmentation probabilities, we introduce a self-entropy based loss function to guide the adaptation method. Furthermore, inferences obtained at different image scales are combined and weighted by the uncertainty to generate scale-invariant pseudo-labels for the target domain. These scale-invariant pseudo-labels are robust to visibility and scale variations. We evaluate the proposed model on real clear-weather scenes to real foggy scenes adaptation and synthetic non-foggy images to real foggy scenes adaptation scenarios. Our experiments demonstrate that FogAdapt significantly outperforms the current state-of-the-art in semantic segmentation of foggy images. Specifically, by considering the standard settings compared to state-of-the-art (SOTA) methods, FogAdapt gains 3.8% on Foggy Zurich, 6.0% on Foggy Driving-dense, and 3.6% on Foggy Driving in mIoU when adapted from Cityscapes to Foggy Zurich.
    Accurate Node Feature Estimation with Structured Variational Graph Autoencoder. (arXiv:2206.04516v1 [cs.LG])
    Given a graph with partial observations of node features, how can we estimate the missing features accurately? Feature estimation is a crucial problem for analyzing real-world graphs whose features are commonly missing during the data collection process. Accurate estimation not only provides diverse information of nodes but also supports the inference of graph neural networks that require the full observation of node features. However, designing an effective approach for estimating high-dimensional features is challenging, since it requires an estimator to have large representation power, increasing the risk of overfitting. In this work, we propose SVGA (Structured Variational Graph Autoencoder), an accurate method for feature estimation. SVGA applies strong regularization to the distribution of latent variables by structured variational inference, which models the prior of variables as Gaussian Markov random field based on the graph structure. As a result, SVGA combines the advantages of probabilistic inference and graph neural networks, achieving state-of-the-art performance in real datasets.
    Clustering with Queries under Semi-Random Noise. (arXiv:2206.04583v1 [cs.LG])
    The seminal paper by Mazumdar and Saha \cite{MS17a} introduced an extensive line of work on clustering with noisy queries. Yet, despite significant progress on the problem, the proposed methods depend crucially on knowing the exact probabilities of errors of the underlying fully-random oracle. In this work, we develop robust learning methods that tolerate general semi-random noise obtaining qualitatively the same guarantees as the best possible methods in the fully-random model. More specifically, given a set of $n$ points with an unknown underlying partition, we are allowed to query pairs of points $u,v$ to check if they are in the same cluster, but with probability $p$, the answer may be adversarially chosen. We show that information theoretically $O\left(\frac{nk \log n} {(1-2p)^2}\right)$ queries suffice to learn any cluster of sufficiently large size. Our main result is a computationally efficient algorithm that can identify large clusters with $O\left(\frac{nk \log n} {(1-2p)^2}\right) + \text{poly}\left(\log n, k, \frac{1}{1-2p} \right)$ queries, matching the guarantees of the best known algorithms in the fully-random model. As a corollary of our approach, we develop the first parameter-free algorithm for the fully-random model, answering an open question by \cite{MS17a}.
    Contrastive Regularization for Semi-Supervised Learning. (arXiv:2201.06247v2 [cs.LG] UPDATED)
    Consistency regularization on label predictions becomes a fundamental technique in semi-supervised learning, but it still requires a large number of training iterations for high performance. In this study, we analyze that the consistency regularization restricts the propagation of labeling information due to the exclusion of samples with unconfident pseudo-labels in the model updates. Then, we propose contrastive regularization to improve both efficiency and accuracy of the consistency regularization by well-clustered features of unlabeled data. In specific, after strongly augmented samples are assigned to clusters by their pseudo-labels, our contrastive regularization updates the model so that the features with confident pseudo-labels aggregate the features in the same cluster, while pushing away features in different clusters. As a result, the information of confident pseudo-labels can be effectively propagated into more unlabeled samples during training by the well-clustered features. On benchmarks of semi-supervised learning tasks, our contrastive regularization improves the previous consistency-based methods and achieves state-of-the-art results, especially with fewer training iterations. Our method also shows robust performance on open-set semi-supervised learning where unlabeled data includes out-of-distribution samples.
    Variational Physics Informed Neural Networks: the role of quadratures and test functions. (arXiv:2109.02035v2 [math.NA] UPDATED)
    In this work we analyze how quadrature rules of different precisions and piecewise polynomial test functions of different degrees affect the convergence rate of Variational Physics Informed Neural Networks (VPINN) with respect to mesh refinement, while solving elliptic boundary-value problems. Using a Petrov-Galerkin framework relying on an inf-sup condition, we derive an a priori error estimate in the energy norm between the exact solution and a suitable high-order piecewise interpolant of a computed neural network. Numerical experiments confirm the theoretical predictions and highlight the importance of the inf-sup condition. Our results suggest, somehow counterintuitively, that for smooth solutions the best strategy to achieve a high decay rate of the error consists in choosing test functions of the lowest polynomial degree, while using quadrature formulas of suitably high precision.
    ECLAD: Extracting Concepts with Local Aggregated Descriptors. (arXiv:2206.04531v1 [cs.CV])
    Convolutional neural networks are being increasingly used in critical systems, where ensuring their robustness and alignment is crucial. In this context, the field of explainable artificial intelligence has proposed the generation of high-level explanations through concept extraction. These methods detect whether a concept is present in an image, but are incapable of locating where. What is more, a fair comparison of approaches is difficult, as proper validation procedures are missing. To fill these gaps, we propose a novel method for automatic concept extraction and localization based on representations obtained through the pixel-wise aggregations of activation maps of CNNs. Further, we introduce a process for the validation of concept-extraction techniques based on synthetic datasets with pixel-wise annotations of their main components, reducing human intervention. Through extensive experimentation on both synthetic and real-world datasets, our method achieves better performance in comparison to state-of-the-art alternatives.
    Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval. (arXiv:2201.12431v2 [cs.CL] UPDATED)
    Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over $k$NN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at https://github.com/neulab/retomaton .
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v1 [cs.LG])
    Injecting noise within gradient descent has several desirable features. In this paper, we explore noise injection before computing a gradient step, which is known to have smoothing and regularizing properties. We show that small perturbations induce explicit regularization for simple finite-dimensional models based on the l1-norm, group l1-norms, or nuclear norms. When applied to overparametrized neural networks with large widths, we show that the same perturbations do not work due to variance explosion resulting from overparametrization. However, we also show that independent layer wise perturbations allow to avoid the exploding variance term, and explicit regularizers can then be obtained. We empirically show that the small perturbations lead to better generalization performance than vanilla (stochastic) gradient descent training, with minor adjustments to the training procedure.
    Probability flow solution of the Fokker-Planck equation. (arXiv:2206.04642v1 [cs.LG])
    The method of choice for integrating the time-dependent Fokker-Planck equation in high-dimension is to generate samples from the solution via integration of the associated stochastic differential equation. Here, we introduce an alternative scheme based on integrating an ordinary differential equation that describes the flow of probability. Unlike the stochastic dynamics, this equation deterministically pushes samples from the initial density onto samples from the solution at any later time. The method has the advantage of giving direct access to quantities that are challenging to estimate only given samples from the solution, such as the probability current, the density itself, and its entropy. The probability flow equation depends on the gradient of the logarithm of the solution (its "score"), and so is a-priori unknown. To resolve this dependence, we model the score with a deep neural network that is learned on-the-fly by propagating a set of particles according to the instantaneous probability current. Our approach is based on recent advances in score-based diffusion for generative modeling, with the important difference that the training procedure is self-contained and does not require samples from the target density to be available beforehand. To demonstrate the validity of the approach, we consider several examples from the physics of interacting particle systems; we find that the method scales well to high-dimensional systems, and accurately matches available analytical solutions and moments computed via Monte-Carlo.
    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (arXiv:2206.04615v1 [cs.CL])
    Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
    Physics-aware Reduced-order Modeling of Transonic Flow via $\beta$-Variational Autoencoder. (arXiv:2205.00608v2 [physics.flu-dyn] UPDATED)
    Autoencoder-based reduced-order modeling (ROM) has recently attracted significant attention, owing to its ability to capture underlying nonlinear features. However, two critical drawbacks severely undermine its scalability to various physical applications: entangled and therefore uninterpretable latent variables (LVs) and the blindfold determination of latent space dimension. In this regard, this study proposes the physics-aware ROM using only interpretable and information-intensive LVs extracted by $\beta$-variational autoencoder, which are referred to as physics-aware LVs throughout this paper. To extract these LVs, their independence and information intensity are quantitatively scrutinized in a two-dimensional transonic flow benchmark problem. Then, the physical meanings of the physics-aware LVs are thoroughly investigated and we confirmed that with appropriate hyperparameter $\beta$, they actually correspond to the generating factors of the training dataset, Mach number and angle of attack. To the best of the authors' knowledge, our work is the first to practically confirm that $\beta$-variational autoencoder can automatically extract the physical generating factors in the field of applied physics. Finally, physics-aware ROM, which utilizes only physics-aware LVs, is compared with conventional ROMs, and its validity and efficiency are successfully verified.
    Transformer based Urdu Handwritten Text Optical Character Reader. (arXiv:2206.04575v1 [cs.CV])
    Extracting Handwritten text is one of the most important components of digitizing information and making it available for large scale setting. Handwriting Optical Character Reader (OCR) is a research problem in computer vision and natural language processing computing, and a lot of work has been done for English, but unfortunately, very little work has been done for low resourced languages such as Urdu. Urdu language script is very difficult because of its cursive nature and change of shape of characters based on it's relative position, therefore, a need arises to propose a model which can understand complex features and generalize it for every kind of handwriting style. In this work, we propose a transformer based Urdu Handwritten text extraction model. As transformers have been very successful in Natural Language Understanding task, we explore them further to understand complex Urdu Handwriting.
    RoMA: a Method for Neural Network Robustness Measurement and Assessment. (arXiv:2110.11088v4 [cs.LG] UPDATED)
    Neural network models have become the leading solution for a large variety of tasks, such as classification, language processing, protein folding, and others. However, their reliability is heavily plagued by adversarial inputs: small input perturbations that cause the model to produce erroneous outputs. Adversarial inputs can occur naturally when the system's environment behaves randomly, even in the absence of a malicious adversary, and are a severe cause for concern when attempting to deploy neural networks within critical systems. In this paper, we present a new statistical method, called Robustness Measurement and Assessment (RoMA), which can measure the expected robustness of a neural network model. Specifically, RoMA determines the probability that a random input perturbation might cause misclassification. The method allows us to provide formal guarantees regarding the expected frequency of errors that a trained model will encounter after deployment. Our approach can be applied to large-scale, black-box neural networks, which is a significant advantage compared to recently proposed verification methods. We apply our approach in two ways: comparing the robustness of different models, and measuring how a model's robustness is affected by the magnitude of input perturbation. One interesting insight obtained through this work is that, in a classification network, different output labels can exhibit very different robustness levels. We term this phenomenon categorial robustness. Our ability to perform risk and robustness assessments on a categorial basis opens the door to risk mitigation, which may prove to be a significant step towards neural network certification in safety-critical applications.
    A Simple Unified Approach to Testing High-Dimensional Conditional Independences for Categorical and Ordinal Data. (arXiv:2206.04356v1 [stat.ML])
    Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.
    BigVGAN: A Universal Neural Vocoder with Large-Scale Training. (arXiv:2206.04658v1 [cs.SD])
    Despite recent progress in generative adversarial network(GAN)-based vocoders, where the model generates raw waveform conditioned on mel spectrogram, it is still challenging to synthesize high-fidelity audio for numerous speakers across varied recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality. Based on our improved generator and the state-of-the-art discriminators, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. In particular, we identify and address the training instabilities specific to such scale, while maintaining high-fidelity output without over-regularization. Our BigVGAN achieves the state-of-the-art zero-shot performance for various out-of-distribution scenarios, including new speakers, novel languages, singing voices, music and instrumental audio in unseen (even noisy) recording environments. We will release our code and model at: https://github.com/NVIDIA/BigVGAN
    Pragmatically Learning from Pedagogical Demonstrations in Multi-Goal Environments. (arXiv:2206.04546v1 [cs.LG])
    Learning from demonstration methods usually leverage close to optimal demonstrations to accelerate training. By contrast, when demonstrating a task, human teachers deviate from optimal demonstrations and pedagogically modify their behavior by giving demonstrations that best disambiguate the goal they want to demonstrate. Analogously, human learners excel at pragmatically inferring the intent of the teacher, facilitating communication between the two agents. These mechanisms are critical in the few demonstrations regime, where inferring the goal is more difficult. In this paper, we implement pedagogy and pragmatism mechanisms by leveraging a Bayesian model of goal inference from demonstrations. We highlight the benefits of this model in multi-goal teacher-learner setups with two artificial agents that learn with goal-conditioned Reinforcement Learning. We show that combining a pedagogical teacher and a pragmatic learner results in faster learning and reduced goal ambiguity over standard learning from demonstrations, especially in the few demonstrations regime.
    Simple lessons from complex learning: what a neural network model learns about cosmic structure formation. (arXiv:2206.04573v1 [astro-ph.CO])
    We train a neural network model to predict the full phase space evolution of cosmological N-body simulations. Its success implies that the neural network model is accurately approximating the Green's function expansion that relates the initial conditions of the simulations to its outcome at later times in the deeply nonlinear regime. We test the accuracy of this approximation by assessing its performance on well understood simple cases that have either known exact solutions or well understood expansions. These scenarios include spherical configurations, isolated plane waves, and two interacting plane waves: initial conditions that are very different from the Gaussian random fields used for training. We find our model generalizes well to these well understood scenarios, demonstrating that the networks have inferred general physical principles and learned the nonlinear mode couplings from the complex, random Gaussian training data. These tests also provide a useful diagnostic for finding the model's strengths and weaknesses, and identifying strategies for model improvement. We also test the model on initial conditions that contain only transverse modes, a family of modes that differ not only in their phases but also in their evolution from the longitudinal growing modes used in the training set. When the network encounters these initial conditions that are orthogonal to the training set, the model fails completely. In addition to these simple configurations, we evaluate the model's predictions for the density, displacement, and momentum power spectra with standard initial conditions for N-body simulations. We compare these summary statistics against N-body results and an approximate, fast simulation method called COLA. Our model achieves percent level accuracy at nonlinear scales of $k\sim 1\ \mathrm{Mpc}^{-1}\, h$, representing a significant improvement over COLA.
    Bounding Training Data Reconstruction in Private (Deep) Learning. (arXiv:2201.12383v3 [cs.LG] UPDATED)
    Differential privacy is widely accepted as the de facto method for preventing data leakage in ML, and conventional wisdom suggests that it offers strong protection against privacy attacks. However, existing semantic guarantees for DP focus on membership inference, which may overestimate the adversary's capabilities and is not applicable when membership status itself is non-sensitive. In this paper, we derive the first semantic guarantees for DP mechanisms against training data reconstruction attacks under a formal threat model. We show that two distinct privacy accounting methods -- Renyi differential privacy and Fisher information leakage -- both offer strong semantic protection against data reconstruction attacks.
    Network insensitivity to parameter noise via adversarial regularization. (arXiv:2106.05009v3 [cs.LG] UPDATED)
    Neuromorphic neural network processors, in the form of compute-in-memory crossbar arrays of memristors, or in the form of subthreshold analog and mixed-signal ASICs, promise enormous advantages in compute density and energy efficiency for NN-based ML tasks. However, these technologies are prone to computational non-idealities, due to process variation and intrinsic device physics. This degrades the task performance of networks deployed to the processor, by introducing parameter noise into the deployed model. While it is possible to calibrate each device, or train networks individually for each processor, these approaches are expensive and impractical for commercial deployment. Alternative methods are therefore needed to train networks that are inherently robust against parameter variation, as a consequence of network architecture and parameters. We present a new adversarial network optimisation algorithm that attacks network parameters during training, and promotes robust performance during inference in the face of parameter variation. Our approach introduces a regularization term penalising the susceptibility of a network to weight perturbation. We compare against previous approaches for producing parameter insensitivity such as dropout, weight smoothing and introducing parameter noise during training. We show that our approach produces models that are more robust to targeted parameter variation, and equally robust to random parameter variation. Our approach finds minima in flatter locations in the weight-loss landscape compared with other approaches, highlighting that the networks found by our technique are less sensitive to parameter perturbation. Our work provides an approach to deploy neural network architectures to inference devices that suffer from computational non-idealities, with minimal loss of performance. ...
    The CLEAR Benchmark: Continual LEArning on Real-World Imagery. (arXiv:2201.06289v3 [cs.CV] UPDATED)
    Continual learning (CL) is widely regarded as crucial challenge for lifelong AI. However, existing CL benchmarks, e.g. Permuted-MNIST and Split-CIFAR, make use of artificial temporal variation and do not align with or generalize to the real-world. In this paper, we introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts in the real world that spans a decade (2004-2014). We build CLEAR from existing large-scale image collections (YFCC100M) through a novel and scalable low-cost approach to visio-linguistic dataset curation. Our pipeline makes use of pretrained vision-language models (e.g. CLIP) to interactively build labeled datasets, which are further validated with crowd-sourcing to remove errors and even inappropriate images (hidden in original YFCC100M). The major strength of CLEAR over prior CL benchmarks is the smooth temporal evolution of visual concepts with real-world imagery, including both high-quality labeled data along with abundant unlabeled samples per time period for continual semi-supervised learning. We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms that only utilize fully-supervised data. Our analysis also reveals that mainstream CL evaluation protocols that train and test on iid data artificially inflate performance of CL system. To address this, we propose novel "streaming" protocols for CL that always test on the (near) future. Interestingly, streaming protocols (a) can simplify dataset curation since today's testset can be repurposed for tomorrow's trainset and (b) can produce more generalizable models with more accurate estimates of performance since all labeled data from each time-period is used for both training and testing (unlike classic iid train-test splits).
    An FPGA-based Solution for Convolution Operation Acceleration. (arXiv:2206.04520v1 [cs.AR])
    Hardware-based acceleration is an extensive attempt to facilitate many computationally-intensive mathematics operations. This paper proposes an FPGA-based architecture to accelerate the convolution operation - a complex and expensive computing step that appears in many Convolutional Neural Network models. We target the design to the standard convolution operation, intending to launch the product as an edge-AI solution. The project's purpose is to produce an FPGA IP core that can process a convolutional layer at a time. System developers can deploy the IP core with various FPGA families by using Verilog HDL as the primary design language for the architecture. The experimental results show that our single computing core synthesized on a simple edge computing FPGA board can offer 0.224 GOPS. When the board is fully utilized, 4.48 GOPS can be achieved.
    Field Level Neural Network Emulator for Cosmological N-body Simulations. (arXiv:2206.04594v1 [astro-ph.CO])
    We build a field level emulator for cosmic structure formation that is accurate in the nonlinear regime. Our emulator consists of two convolutional neural networks trained to output the nonlinear displacements and velocities of N-body simulation particles based on their linear inputs. Cosmology dependence is encoded in the form of style parameters at each layer of the neural network, enabling the emulator to effectively interpolate the outcomes of structure formation between different flat $\Lambda$CDM cosmologies over a wide range of background matter densities. The neural network architecture makes the model differentiable by construction, providing a powerful tool for fast field level inference. We test the accuracy of our method by considering several summary statistics, including the density power spectrum with and without redshift space distortions, the displacement power spectrum, the momentum power spectrum, the density bispectrum, halo abundances, and halo profiles with and without redshift space distortions. We compare these statistics from our emulator with the full N-body results, the COLA method, and a fiducial neural network with no cosmological dependence. We find our emulator gives accurate results down to scales of $k \sim 1\ \mathrm{Mpc}^{-1}\, h$, representing a considerable improvement over both COLA and the fiducial neural network. We also demonstrate that our emulator generalizes well to initial conditions containing primordial non-Gaussianity, without the need for any additional style parameters or retraining.
    Spatial Entropy Regularization for Vision Transformers. (arXiv:2206.04636v1 [cs.CV])
    Recent work has shown that the attention maps of Vision Transformers (VTs), when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. In this paper, we explicitly encourage the emergence of this spatial clustering as a form of training regularization, this way including a self-supervised pretext task into the standard supervised learning. In more detail, we propose a VT regularization method based on a spatial formulation of the information entropy. By minimizing the proposed spatial entropy, we explicitly ask the VT to produce spatially ordered attention maps, this way including an object-based prior during training. Using extensive experiments, we show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures. The code will be available upon acceptance.
    Push--Pull with Device Sampling. (arXiv:2206.04113v1 [math.OC])
    We consider decentralized optimization problems in which a number of agents collaborate to minimize the average of their local functions by exchanging over an underlying communication graph. Specifically, we place ourselves in an asynchronous model where only a random portion of nodes perform computation at each iteration, while the information exchange can be conducted between all the nodes and in an asymmetric fashion. For this setting, we propose an algorithm that combines gradient tracking and variance reduction over the entire network. This enables each node to track the average of the gradients of the objective functions. Our theoretical analysis shows that the algorithm converges linearly, when the local objective functions are strongly convex, under mild connectivity conditions on the expected mixing matrices. In particular, our result does not require the mixing matrices to be doubly stochastic. In the experiments, we investigate a broadcast mechanism that transmits information from computing nodes to their neighbors, and confirm the linear convergence of our method on both synthetic and real-world datasets.
    Depression Recognition using Remote Photoplethysmography from Facial Videos. (arXiv:2206.04399v1 [cs.CV])
    Depression is a mental illness that may be harmful to an individual's health. The detection of mental health disorders in the early stages and a precise diagnosis are critical to avoid social, physiological, or psychological side effects. This work analyzes physiological signals to observe if different depressive states have a noticeable impact on the blood volume pulse (BVP) and the heart rate variability (HRV) response. Although typically, HRV features are calculated from biosignals obtained with contact-based sensors such as wearables, we propose instead a novel scheme that directly extracts them from facial videos, just based on visual information, removing the need for any contact-based device. Our solution is based on a pipeline that is able to extract complete remote photoplethysmography signals (rPPG) in a fully unsupervised manner. We use these rPPG signals to calculate over 60 statistical, geometrical, and physiological features that are further used to train several machine learning regressors to recognize different levels of depression. Experiments on two benchmark datasets indicate that this approach offers comparable results to other audiovisual modalities based on voice or facial expression, potentially complementing them. In addition, the results achieved for the proposed method show promising and solid performance that outperforms hand-engineered methods and is comparable to deep learning-based approaches.
    Explaining Clinical Decision Support Systems in Medical Imaging using Cycle-Consistent Activation Maximization. (arXiv:2010.05759v3 [eess.IV] UPDATED)
    Clinical decision support using deep neural networks has become a topic of steadily growing interest. While recent work has repeatedly demonstrated that deep learning offers major advantages for medical image classification over traditional methods, clinicians are often hesitant to adopt the technology because its underlying decision-making process is considered to be intransparent and difficult to comprehend. In recent years, this has been addressed by a variety of approaches that have successfully contributed to providing deeper insight. Most notably, additive feature attribution methods are able to propagate decisions back into the input space by creating a saliency map which allows the practitioner to "see what the network sees." However, the quality of the generated maps can become poor and the images noisy if only limited data is available - a typical scenario in clinical contexts. We propose a novel decision explanation scheme based on CycleGAN activation maximization which generates high-quality visualizations of classifier decisions even in smaller data sets. We conducted a user study in which we evaluated our method on the LIDC dataset for lung lesion malignancy classification, the BreastMNIST dataset for ultrasound image breast cancer detection, as well as two subsets of the CIFAR-10 dataset for RBG image object recognition. Within this user study, our method clearly outperformed existing approaches on the medical imaging datasets and ranked second in the natural image setting. With our approach we make a significant contribution towards a better understanding of clinical decision support systems based on deep neural networks and thus aim to foster overall clinical acceptance.
    Alternating Mirror Descent for Constrained Min-Max Games. (arXiv:2206.04160v1 [cs.GT])
    In this paper we study two-player bilinear zero-sum games with constrained strategy spaces. An instance of natural occurrences of such constraints is when mixed strategies are used, which correspond to a probability simplex constraint. We propose and analyze the alternating mirror descent algorithm, in which each player takes turns to take action following the mirror descent algorithm for constrained optimization. We interpret alternating mirror descent as an alternating discretization of a skew-gradient flow in the dual space, and use tools from convex optimization and modified energy function to establish an $O(K^{-2/3})$ bound on its average regret after $K$ iterations. This quantitatively verifies the algorithm's better behavior than the simultaneous version of mirror descent algorithm, which is known to diverge and yields an $O(K^{-1/2})$ average regret bound. In the special case of an unconstrained setting, our results recover the behavior of alternating gradient descent algorithm for zero-sum games which was studied in (Bailey et al., COLT 2020).
    Unlearning Protected User Attributes in Recommendations with Adversarial Training. (arXiv:2206.04500v1 [cs.IR])
    Collaborative filtering algorithms capture underlying consumption patterns, including the ones specific to particular demographics or protected information of users, e.g. gender, race, and location. These encoded biases can influence the decision of a recommendation system (RS) towards further separation of the contents provided to various demographic subgroups, and raise privacy concerns regarding the disclosure of users' protected attributes. In this work, we investigate the possibility and challenges of removing specific protected information of users from the learned interaction representations of a RS algorithm, while maintaining its effectiveness. Specifically, we incorporate adversarial training into the state-of-the-art MultVAE architecture, resulting in a novel model, Adversarial Variational Auto-Encoder with Multinomial Likelihood (Adv-MultVAE), which aims at removing the implicit information of protected attributes while preserving recommendation performance. We conduct experiments on the MovieLens-1M and LFM-2b-DemoBias datasets, and evaluate the effectiveness of the bias mitigation method based on the inability of external attackers in revealing the users' gender information from the model. Comparing with baseline MultVAE, the results show that Adv-MultVAE, with marginal deterioration in performance (w.r.t. NDCG and recall), largely mitigates inherent biases in the model on both datasets.
    ScatterSample: Diversified Label Sampling for Data Efficient Graph Neural Network Learning. (arXiv:2206.04255v1 [cs.LG])
    What target labels are most effective for graph neural network (GNN) training? In some applications where GNNs excel-like drug design or fraud detection, labeling new instances is expensive. We develop a data-efficient active sampling framework, ScatterSample, to train GNNs under an active learning setting. ScatterSample employs a sampling module termed DiverseUncertainty to collect instances with large uncertainty from different regions of the sample space for labeling. To ensure diversification of the selected nodes, DiverseUncertainty clusters the high uncertainty nodes and selects the representative nodes from each cluster. Our ScatterSample algorithm is further supported by rigorous theoretical analysis demonstrating its advantage compared to standard active sampling methods that aim to simply maximize the uncertainty and not diversify the samples. In particular, we show that ScatterSample is able to efficiently reduce the model uncertainty over the whole sample space. Our experiments on five datasets show that ScatterSample significantly outperforms the other GNN active learning baselines, specifically it reduces the sampling cost by up to 50% while achieving the same test accuracy.
    Data-Efficient Brain Connectome Analysis via Multi-Task Meta-Learning. (arXiv:2206.04486v1 [cs.LG])
    Brain networks characterize complex connectivities among brain regions as graph structures, which provide a powerful means to study brain connectomes. In recent years, graph neural networks have emerged as a prevalent paradigm of learning with structured data. However, most brain network datasets are limited in sample sizes due to the relatively high cost of data acquisition, which hinders the deep learning models from sufficient training. Inspired by meta-learning that learns new concepts fast with limited training examples, this paper studies data-efficient training strategies for analyzing brain connectomes in a cross-dataset setting. Specifically, we propose to meta-train the model on datasets of large sample sizes and transfer the knowledge to small datasets. In addition, we also explore two brain-network-oriented designs, including atlas transformation and adaptive task reweighing. Compared to other pre-training strategies, our meta-learning-based approach achieves higher and stabler performance, which demonstrates the effectiveness of our proposed solutions. The framework is also able to derive new insights regarding the similarities among datasets and diseases in a data-driven fashion.
    Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis. (arXiv:2206.04281v1 [cs.CV])
    Recent self-supervised advances in medical computer vision exploit global and local anatomical self-similarity for pretraining prior to downstream tasks such as segmentation. However, current methods assume i.i.d. image acquisition, which is invalid in clinical study designs where follow-up longitudinal scans track subject-specific temporal changes. Further, existing self-supervised methods for medically-relevant image-to-image architectures exploit only spatial or temporal self-similarity and only do so via a loss applied at a single image-scale, with naive multi-scale spatiotemporal extensions collapsing to degenerate solutions. To these ends, this paper makes two contributions: (1) It presents a local and multi-scale spatiotemporal representation learning method for image-to-image architectures trained on longitudinal images. It exploits the spatiotemporal self-similarity of learned multi-scale intra-subject features for pretraining and develops several feature-wise regularizations that avoid collapsed identity representations; (2) During finetuning, it proposes a surprisingly simple self-supervised segmentation consistency regularization to exploit intra-subject correlation. Benchmarked in the one-shot segmentation setting, the proposed framework outperforms both well-tuned randomly-initialized baselines and current self-supervised techniques designed for both i.i.d. and longitudinal datasets. These improvements are demonstrated across both longitudinal neurodegenerative adult MRI and developing infant brain MRI and yield both higher performance and longitudinal consistency.
    GSmooth: Certified Robustness against Semantic Transformations via Generalized Randomized Smoothing. (arXiv:2206.04310v1 [cs.LG])
    Certified defenses such as randomized smoothing have shown promise towards building reliable machine learning systems against $\ell_p$-norm bounded attacks. However, existing methods are insufficient or unable to provably defend against semantic transformations, especially those without closed-form expressions (such as defocus blur and pixelate), which are more common in practice and often unrestricted. To fill up this gap, we propose generalized randomized smoothing (GSmooth), a unified theoretical framework for certifying robustness against general semantic transformations via a novel dimension augmentation strategy. Under the GSmooth framework, we present a scalable algorithm that uses a surrogate image-to-image network to approximate the complex transformation. The surrogate model provides a powerful tool for studying the properties of semantic transformations and certifying robustness. Experimental results on several datasets demonstrate the effectiveness of our approach for robustness certification against multiple kinds of semantic transformations and corruptions, which is not achievable by the alternative baselines.
    Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. (arXiv:2206.04119v1 [q-bio.BM])
    Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)-equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the large-compute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.
    Unsupervised Dictionary Learning for Anomaly Detection. (arXiv:2003.00293v2 [cs.LG] CROSS LISTED)
    We investigate the possibilities of employing dictionary learning to address the requirements of most anomaly detection applications, such as absence of supervision, online formulations, low false positive rates. We present new results of our recent semi-supervised online algorithm, TODDLeR, on a anti-money laundering application. We also introduce a novel unsupervised method of using the performance of the learning algorithm as indication of the nature of the samples.
    Wireless for Machine Learning. (arXiv:2008.13492v3 [eess.SP] UPDATED)
    As data generation increasingly takes place on devices without a wired connection, machine learning (ML) related traffic will be ubiquitous in wireless networks. Many studies have shown that traditional wireless protocols are highly inefficient or unsustainable to support ML, which creates the need for new wireless communication methods. In this survey, we give an exhaustive review of the state-of-the-art wireless methods that are specifically designed to support ML services over distributed datasets. Currently, there are two clear themes within the literature, analog over-the-air computation and digital radio resource management optimized for ML. This survey gives a comprehensive introduction to these methods, reviews the most important works, highlights open problems, and discusses application scenarios.
    What is a Good Metric to Study Generalization of Minimax Learners?. (arXiv:2206.04502v1 [stat.ML])
    Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in minimax settings, their generalization guarantees in the stochastic setting, i.e., how the solution trained on empirical data performs on the unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization, fails in simple examples of minimax problems. Furthermore, another popular metric, the primal-dual risk, also fails to characterize the generalization behavior for minimax problems with nonconvexity, due to non-existence of saddle points. We thus propose a new metric to study generalization of minimax learners: the primal gap, to circumvent these issues. Next, we derive generalization bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization bounds for primal risk and primal-dual risk in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms -- gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization.
    Uncovering bias in the PlantVillage dataset. (arXiv:2206.04374v1 [cs.CV])
    We report our investigation on the use of the popular PlantVillage dataset for training deep learning based plant disease detection models. We trained a machine learning model using only 8 pixels from the PlantVillage image backgrounds. The model achieved 49.0% accuracy on the held-out test set, well above the random guessing accuracy of 2.6%. This result indicates that the PlantVillage dataset contains noise correlated with the labels and deep learning models can easily exploit this bias to make predictions. Possible approaches to alleviate this problem are discussed.
    Boosting Fast Adversarial Training with Learnable Adversarial Initialization. (arXiv:2110.05007v2 [cs.CV] UPDATED)
    Adversarial training (AT) has been demonstrated to be effective in improving model robustness by leveraging adversarial examples for training. However, most AT methods are in face of expensive time and computational cost for calculating gradients at multiple steps in generating adversarial examples. To boost training efficiency, fast gradient sign method (FGSM) is adopted in fast AT methods by calculating gradient only once. Unfortunately, the robustness is far from satisfactory. One reason may arise from the initialization fashion. Existing fast AT generally uses a random sample-agnostic initialization, which facilitates the efficiency yet hinders a further robustness improvement. Up to now, the initialization in fast AT is still not extensively explored. In this paper, we boost fast AT with a sample-dependent adversarial initialization, i.e., an output from a generative network conditioned on a benign image and its gradient information from the target network. As the generative network and the target network are optimized jointly in the training phase, the former can adaptively generate an effective initialization with respect to the latter, which motivates gradually improved robustness. Experimental evaluations on four benchmark databases demonstrate the superiority of our proposed method over state-of-the-art fast AT methods, as well as comparable robustness to advanced multi-step AT methods. The code is released at https://github.com//jiaxiaojunQAQ//FGSM-SDI.
    Multi-Mask Self-Supervised Learning for Physics-Guided Neural Networks in Highly Accelerated MRI. (arXiv:2008.06029v2 [eess.IV] UPDATED)
    Self-supervised learning has shown great promise due to its capability to train deep learning MRI reconstruction methods without fully-sampled data. Current self-supervised learning methods for physics-guided reconstruction networks split acquired undersampled data into two disjoint sets, where one is used for data consistency (DC) in the unrolled network and the other to define the training loss. In this study, we propose an improved self-supervised learning strategy that more efficiently uses the acquired data to train a physics-guided reconstruction network without a database of fully-sampled data. The proposed multi-mask self-supervised learning via data undersampling (SSDU) applies a hold-out masking operation on acquired measurements to split it into multiple pairs of disjoint sets for each training sample, while using one of these pairs for DC units and the other for defining loss, thereby more efficiently using the undersampled data. Multi-mask SSDU is applied on fully-sampled 3D knee and prospectively undersampled 3D brain MRI datasets, for various acceleration rates and patterns, and compared to CG-SENSE and single-mask SSDU DL-MRI, as well as supervised DL-MRI when fully-sampled data is available. Results on knee MRI show that the proposed multi-mask SSDU outperforms SSDU and performs closely with supervised DL-MRI. A clinical reader study further ranks the multi-mask SSDU higher than supervised DL-MRI in terms of SNR and aliasing artifacts. Results on brain MRI show that multi-mask SSDU achieves better reconstruction quality compared to SSDU. Reader study demonstrates that multi-mask SSDU at R=8 significantly improves reconstruction compared to single-mask SSDU at R=8, as well as CG-SENSE at R=2.
    Community-Level Anomaly Detection for Anti-Money Laundering. (arXiv:1910.11313v1 [cs.LG] CROSS LISTED)
    Anomaly detection in networks often boils down to identifying an underlying graph structure on which the abnormal occurrence rests on. Financial fraud schemes are one such example, where more or less intricate schemes are employed in order to elude transaction security protocols. We investigate the problem of learning graph structure representations using adaptations of dictionary learning aimed at encoding connectivity patterns. In particular, we adapt dictionary learning strategies to the specificity of network topologies and propose new methods that impose Laplacian structure on the dictionaries themselves. In one adaption we focus on classifying topologies by working directly on the graph Laplacian and cast the learning problem to accommodate its 2D structure. We tackle the same problem by learning dictionaries which consist of vectorized atomic Laplacians, and provide a block coordinate descent scheme to solve the new dictionary learning formulation. Imposing Laplacian structure on the dictionaries is also proposed in an adaptation of the Single Block Orthogonal learning method. Results on synthetic graph datasets comprising different graph topologies confirm the potential of dictionaries to directly represent graph structure information.
    TAG: Toward Accurate Social Media Content Tagging with a Concept Graph. (arXiv:2110.06892v3 [cs.LG] UPDATED)
    Although conceptualization has been widely studied in semantics and knowledge representation, it is still challenging to find the most accurate concept phrases to characterize the main idea of a text snippet on the fast-growing social media. This is partly attributed to the fact that most knowledge bases contain general terms of the world, such as trees and cars, which do not have the defining power or are not interesting enough to social media app users. Another reason is that the intricacy of natural language allows the use of tense, negation and grammar to change the logic or emphasis of language, thus conveying completely different meanings. In this paper, we present TAG, a high-quality concept matching dataset consisting of 10,000 labeled pairs of fine-grained concepts and web-styled natural language sentences, mined from the open-domain social media. The concepts we consider represent the trending interests of online users. Associated with TAG is a concept graph of these fine-grained concepts and entities to provide the structural context information. We evaluate a wide range of popular neural text matching models as well as pre-trained language models on TAG, and point out their insufficiency to tag social media content with the most appropriate concept. We further propose a novel graph-graph matching method that demonstrates superior abstraction and generalization performance by better utilizing both the structural context in the concept graph and logic interactions between semantic units in the sentence via syntactic dependency parsing. We open-source both the TAG dataset and the proposed methods to facilitate further research.
    Privacy Leakage in Text Classification: A Data Extraction Approach. (arXiv:2206.04591v1 [cs.CL])
    Recent work has demonstrated the successful extraction of training data from generative language models. However, it is not evident whether such extraction is feasible in text classification models since the training objective is to predict the class label as opposed to next-word prediction. This poses an interesting challenge and raises an important question regarding the privacy of training data in text classification settings. Therefore, we study the potential privacy leakage in the text classification domain by investigating the problem of unintended memorization of training data that is not pertinent to the learning task. We propose an algorithm to extract missing tokens of a partial text by exploiting the likelihood of the class label provided by the model. We test the effectiveness of our algorithm by inserting canaries into the training set and attempting to extract tokens in these canaries post-training. In our experiments, we demonstrate that successful extraction is possible to some extent. This can also be used as an auditing strategy to assess any potential unauthorized use of personal data without consent.
    ADG-Pose: Automated Dataset Generation for Real-World Human Pose Estimation. (arXiv:2202.00753v2 [cs.CV] UPDATED)
    Recent advancements in computer vision have seen a rise in the prominence of applications using neural networks to understand human poses. However, while accuracy has been steadily increasing on State-of-the-Art datasets, these datasets often do not address the challenges seen in real-world applications. These challenges are dealing with people distant from the camera, people in crowds, and heavily occluded people. As a result, many real-world applications have trained on data that does not reflect the data present in deployment, leading to significant underperformance. This article presents ADG-Pose, a method for automatically generating datasets for real-world human pose estimation. These datasets can be customized to determine person distances, crowdedness, and occlusion distributions. Models trained with our method are able to perform in the presence of these challenges where those trained on other datasets fail. Using ADG-Pose, end-to-end accuracy for real-world skeleton-based action recognition sees a 20% increase on scenes with moderate distance and occlusion levels, and a 4X increase on distant scenes where other models failed to perform better than random.
    SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization. (arXiv:2205.07547v2 [cs.LG] UPDATED)
    One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE). In SQ-VAE, we observe a trend that the quantization is stochastic at the initial stage of the training but gradually converges toward a deterministic quantization, which we call self-annealing. Our experiments show that SQ-VAE improves codebook utilization without using common heuristics. Furthermore, we empirically show that SQ-VAE is superior to VAE and VQ-VAE in vision- and speech-related tasks.
    Graph Attention Multi-Layer Perceptron. (arXiv:2206.04355v1 [cs.LG])
    Graph neural networks (GNNs) have achieved great success in many graph-based applications. However, the enormous size and high sparsity level of graphs hinder their applications under industrial scenarios. Although some scalable GNNs are proposed for large-scale graphs, they adopt a fixed $K$-hop neighborhood for each node, thus facing the over-smoothing issue when adopting large propagation depths for nodes within sparse regions. To tackle the above issue, we propose a new GNN architecture -- Graph Attention Multi-Layer Perceptron (GAMLP), which can capture the underlying correlations between different scales of graph knowledge. We have deployed GAMLP in Tencent with the Angel platform, and we further evaluate GAMLP on both real-world datasets and large-scale industrial datasets. Extensive experiments on these 14 graph datasets demonstrate that GAMLP achieves state-of-the-art performance while enjoying high scalability and efficiency. Specifically, it outperforms GAT by 1.3\% regarding predictive accuracy on our large-scale Tencent Video dataset while achieving up to $50\times$ training speedup. Besides, it ranks top-1 on both the leaderboards of the largest homogeneous and heterogeneous graph (i.e., ogbn-papers100M and ogbn-mag) of Open Graph Benchmark.
    Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent. (arXiv:2002.04861v3 [stat.ML] UPDATED)
    We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior.
    Generalization and Robustness Implications in Object-Centric Learning. (arXiv:2107.00637v3 [cs.LG] UPDATED)
    The idea behind object-centric representation learning is that natural scenes can better be modeled as compositions of objects and their relations as opposed to distributed representations. This inductive bias can be injected into neural networks to potentially improve systematic generalization and performance of downstream tasks in scenes with multiple objects. In this paper, we train state-of-the-art unsupervised models on five common multi-object datasets and evaluate segmentation metrics and downstream object property prediction. In addition, we study generalization and robustness by investigating the settings where either a single object is out of distribution -- e.g., having an unseen color, texture, or shape -- or global properties of the scene are altered -- e.g., by occlusions, cropping, or increasing the number of objects. From our experimental study, we find object-centric representations to be useful for downstream tasks and generally robust to most distribution shifts affecting objects. However, when the distribution shift affects the input in a less structured manner, robustness in terms of segmentation and downstream task performance may vary significantly across models and distribution shifts.
    It's a super deal -- train recurrent network on noisy data and get smooth prediction free. (arXiv:2206.04215v1 [cs.LG])
    Recent research demonstrate that prediction of time series by predictive recurrent neural networks based on the noisy input generates a {\it smooth} anticipated trajectory. We examine influence of the noise component in both the training data sets and the input sequences on network prediction quality. We propose and discuss an explanation of the observed noise compression in the predictive process. We also discuss importance of this property of recurrent networks in the neuroscience context for the evolution of living organisms.
    Gradient Obfuscation Gives a False Sense of Security in Federated Learning. (arXiv:2206.04055v1 [cs.CR])
    Federated learning has been proposed as a privacy-preserving machine learning framework that enables multiple clients to collaborate without sharing raw data. However, client privacy protection is not guaranteed by design in this framework. Prior work has shown that the gradient sharing strategies in federated learning can be vulnerable to data reconstruction attacks. In practice, though, clients may not transmit raw gradients considering the high communication cost or due to privacy enhancement requirements. Empirical studies have demonstrated that gradient obfuscation, including intentional obfuscation via gradient noise injection and unintentional obfuscation via gradient compression, can provide more privacy protection against reconstruction attacks. In this work, we present a new data reconstruction attack framework targeting the image classification task in federated learning. We show that commonly adopted gradient postprocessing procedures, such as gradient quantization, gradient sparsification, and gradient perturbation, may give a false sense of security in federated learning. Contrary to prior studies, we argue that privacy enhancement should not be treated as a byproduct of gradient compression. Additionally, we design a new method under the proposed framework to reconstruct the image at the semantic level. We quantify the semantic privacy leakage and compare with conventional based on image similarity scores. Our comparisons challenge the image data leakage evaluation schemes in the literature. The results emphasize the importance of revisiting and redesigning the privacy protection mechanisms for client data in existing federated learning algorithms.
    Receding Horizon Inverse Reinforcement Learning. (arXiv:2206.04477v1 [cs.LG])
    Inverse reinforcement learning (IRL) seeks to infer a cost function that explains the underlying goals and preferences of expert demonstrations. This paper presents receding horizon inverse reinforcement learning (RHIRL), a new IRL algorithm for high-dimensional, noisy, continuous systems with black-box dynamic models. RHIRL addresses two key challenges of IRL: scalability and robustness. To handle high-dimensional continuous systems, RHIRL matches the induced optimal trajectories with expert demonstrations locally in a receding horizon manner and 'stitches' together the local solutions to learn the cost; it thereby avoids the 'curse of dimensionality'. This contrasts sharply with earlier algorithms that match with expert demonstrations globally over the entire high-dimensional state space. To be robust against imperfect expert demonstrations and system control noise, RHIRL learns a state-dependent cost function 'disentangled' from system dynamics under mild conditions. Experiments on benchmark tasks show that RHIRL outperforms several leading IRL algorithms in most instances. We also prove that the cumulative error of RHIRL grows linearly with the task duration.
    Enhancement of Healthcare Data Transmission using the Levenberg-Marquardt Algorithm. (arXiv:2206.04240v1 [cs.LG])
    In the healthcare system, patients are required to use wearable devices for the remote data collection and real-time monitoring of health data and the status of health conditions. This adoption of wearables results in a significant increase in the volume of data that is collected and transmitted. As the devices are run by small battery power, they can be quickly diminished due to the high processing requirements of the device for data collection and transmission. Given the importance attached to medical data, it is imperative that all transmitted data adhere to strict integrity and availability requirements. Reducing the volume of healthcare data and the frequency of transmission will improve the device battery life via using inference algorithm. There is an issue of improving transmission metrics with accuracy and efficiency, which trade-off each other such as increasing accuracy reduces the efficiency. This paper demonstrates that machine learning can be used to analyze complex health data metrics such as the accuracy and efficiency of data transmission to overcome the trade-off problem using the Levenberg-Marquardt algorithm to enhance both metrics by taking fewer samples to transmit whilst maintaining the accuracy. The algorithm is tested with a standard heart rate dataset to compare the metrics. The result shows that the LMA has best performed with an efficiency of 3.33 times for reduced sample data size and accuracy of 79.17%, which has the similar accuracies in 7 different sampling cases adopted for testing but demonstrates improved efficiency. These proposed methods significantly improved both metrics using machine learning without sacrificing a metric over the other compared to the existing methods with high efficiency.
    An Optimization Method-Assisted Ensemble Deep Reinforcement Learning Algorithm to Solve Unit Commitment Problems. (arXiv:2206.04249v1 [eess.SY])
    Unit commitment (UC) is a fundamental problem in the day-ahead electricity market, and it is critical to solve UC problems efficiently. Mathematical optimization techniques like dynamic programming, Lagrangian relaxation, and mixed-integer quadratic programming (MIQP) are commonly adopted for UC problems. However, the calculation time of these methods increases at an exponential rate with the amount of generators and energy resources, which is still the main bottleneck in industry. Recent advances in artificial intelligence have demonstrated the capability of reinforcement learning (RL) to solve UC problems. Unfortunately, the existing research on solving UC problems with RL suffers from the curse of dimensionality when the size of UC problems grows. To deal with these problems, we propose an optimization method-assisted ensemble deep reinforcement learning algorithm, where UC problems are formulated as a Markov Decision Process (MDP) and solved by multi-step deep Q-learning in an ensemble framework. The proposed algorithm establishes a candidate action set by solving tailored optimization problems to ensure a relatively high performance and the satisfaction of operational constraints. Numerical studies on IEEE 118 and 300-bus systems show that our algorithm outperforms the baseline RL algorithm and MIQP. Furthermore, the proposed algorithm shows strong generalization capacity under unforeseen operational conditions.
    Pseudo-Poincar\'e: A Unification Framework for Euclidean and Hyperbolic Graph Neural Networks. (arXiv:2206.04285v1 [cs.LG])
    Hyperbolic neural networks have recently gained significant attention due to their promising results on several graph problems including node classification and link prediction. The primary reason for this success is the effectiveness of the hyperbolic space in capturing the inherent hierarchy of graph datasets. However, they are limited in terms of generalization, scalability, and have inferior performance when it comes to non-hierarchical datasets. In this paper, we take a completely orthogonal perspective for modeling hyperbolic networks. We use Poincar\'e disk to model the hyperbolic geometry and also treat it as if the disk itself is a tangent space at origin. This enables us to replace non-scalable M\"obius gyrovector operations with an Euclidean approximation, and thus simplifying the entire hyperbolic model to a Euclidean model cascaded with a hyperbolic normalization function. Our approach does not adhere to M\"obius math, yet it still works in the Riemannian manifold, hence we call it Pseudo-Poincar\'e framework. We applied our non-linear hyperbolic normalization to the current state-of-the-art homogeneous and multi-relational graph networks and demonstrate significant improvements in performance compared to both Euclidean and hyperbolic counterparts. The primary impact of this work lies in its ability to capture hierarchical features in the Euclidean space, and thus, can replace hyperbolic networks without loss in performance metrics while simultaneously leveraging the power of Euclidean networks such as interpretability and efficient execution of various model components.
    Unsupervised Knowledge Adaptation for Passenger Demand Forecasting. (arXiv:2206.04053v1 [cs.LG])
    Considering the multimodal nature of transport systems and potential cross-modal correlations, there is a growing trend of enhancing demand forecasting accuracy by learning from multimodal data. These multimodal forecasting models can improve accuracy but be less practical when different parts of multimodal datasets are owned by different institutions who cannot directly share data among them. While various institutions may can not share their data with each other directly, they may share forecasting models trained by their data, where such models cannot be used to identify the exact information from their datasets. This study proposes an Unsupervised Knowledge Adaptation Demand Forecasting framework to forecast the demand of the target mode by utilizing a pre-trained model based on data of another mode, which does not require direct data sharing of the source mode. The proposed framework utilizes the potential shared patterns among multiple transport modes to improve forecasting performance while avoiding the direct sharing of data among different institutions. Specifically, a pre-trained forecasting model is first learned based on the data of a source mode, which can capture and memorize the source travel patterns. Then, the demand data of the target dataset is encoded into an individual knowledge part and a sharing knowledge part which will extract travel patterns by individual extraction network and sharing extraction network, respectively. The unsupervised knowledge adaptation strategy is utilized to form the sharing features for further forecasting by making the pre-trained network and the sharing extraction network analogous. Our findings illustrate that unsupervised knowledge adaptation by sharing the pre-trained model to the target mode can improve the forecasting performance without the dependence on direct data sharing.
    A General Framework For Proving The Equivariant Strong Lottery Ticket Hypothesis. (arXiv:2206.04270v1 [cs.LG])
    The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of a subnetwork within a sufficiently overparameterized (dense) neural network that -- when initialized randomly and without any training -- achieves the accuracy of a fully trained target network. Recent work by \citet{da2022proving} demonstrates that the SLTH can also be extended to translation equivariant networks -- i.e. CNNs -- with the same level of overparametrization as needed for SLTs in dense networks. However, modern neural networks are capable of incorporating more than just translation symmetry, and developing general equivariant architectures such as rotation and permutation has been a powerful design principle. In this paper, we generalize the SLTH to functions that preserve the action of the group $G$ -- i.e. $G$-equivariant network -- and prove, with high probability, that one can prune a randomly initialized overparametrized $G$-equivariant network to a $G$-equivariant subnetwork that approximates another fully trained $G$-equivariant network of fixed width and depth. We further prove that our prescribed overparametrization scheme is also optimal as a function of the error tolerance. We develop our theory for a large range of groups, including important ones such as subgroups of the Euclidean group $\text{E}(n)$ and subgroups of the symmetric group $G \leq \mathcal{S}_n$ -- allowing us to find SLTs for MLPs, CNNs, $\text{E}(2)$-steerable CNNs, and permutation equivariant networks as specific instantiations of our unified framework which completely extends prior work. Empirically, we verify our theory by pruning overparametrized $\text{E}(2)$-steerable CNNs and message passing GNNs to match the performance of trained target networks within a given error tolerance.
    N-ACT: An Interpretable Deep Learning Model for Automatic Cell Type and Salient Gene Identification. (arXiv:2206.04047v1 [q-bio.GN])
    Single-cell RNA sequencing (scRNAseq) is rapidly advancing our understanding of cellular composition within complex tissues and organisms. A major limitation in most scRNAseq analysis pipelines is the reliance on manual annotations to determine cell identities, which are time consuming, subjective, and require expertise. Given the surge in cell sequencing, supervised methods-especially deep learning models-have been developed for automatic cell type identification (ACTI), which achieve high accuracy and scalability. However, all existing deep learning frameworks for ACTI lack interpretability and are used as "black-box" models. We present N-ACT (Neural-Attention for Cell Type identification): the first-of-its-kind interpretable deep neural network for ACTI utilizing neural-attention to detect salient genes for use in cell-type identification. We compare N-ACT to conventional annotation methods on two previously manually annotated data sets, demonstrating that N-ACT accurately identifies marker genes and cell types in an unsupervised manner, while performing comparably on multiple data sets to current state-of-the-art model in traditional supervised ACTI.
    What-Is and How-To for Fairness in Machine Learning: A Survey, Reflection, and Perspective. (arXiv:2206.04101v1 [cs.LG])
    Algorithmic fairness has attracted increasing attention in the machine learning community. Various definitions are proposed in the literature, but the differences and connections among them are not clearly addressed. In this paper, we review and reflect on various fairness notions previously proposed in machine learning literature, and make an attempt to draw connections to arguments in moral and political philosophy, especially theories of justice. We also consider fairness inquiries from a dynamic perspective, and further consider the long-term impact that is induced by current prediction and decision. In light of the differences in the characterized fairness, we present a flowchart that encompasses implicit assumptions and expected outcomes of different types of fairness inquiries on the data generating process, on the predicted outcome, and on the induced impact, respectively. This paper demonstrates the importance of matching the mission (which kind of fairness one would like to enforce) and the means (which spectrum of fairness analysis is of interest, what is the appropriate analyzing scheme) to fulfill the intended purpose.
    Hidden Markov Models with Momentum. (arXiv:2206.04057v1 [cs.LG])
    Momentum is a popular technique for improving convergence rates during gradient descent. In this research, we experiment with adding momentum to the Baum-Welch expectation-maximization algorithm for training Hidden Markov Models. We compare discrete Hidden Markov Models trained with and without momentum on English text and malware opcode data. The effectiveness of momentum is determined by measuring the changes in model score and classification accuracy due to momentum. Our extensive experiments indicate that adding momentum to Baum-Welch can reduce the number of iterations required for initial convergence during HMM training, particularly in cases where the model is slow to converge. However, momentum does not seem to improve the final model performance at a high number of iterations.
    Uplifting Bandits. (arXiv:2206.04091v1 [stat.ML])
    We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them. After each action, the agent observes the realizations of all the variables. This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers, such as clicks. We propose UCB-style algorithms that estimate the uplifts of the actions over a baseline. We study multiple variants of the problem, including when the baseline and affected variables are unknown, and prove sublinear regret bounds for all of these. We also provide lower bounds that justify the necessity of our modeling assumptions. Experiments on synthetic and real-world datasets show the benefit of methods that estimate the uplifts over policies that do not use this structure.
    On Transfer Learning in Functional Linear Regression. (arXiv:2206.04277v1 [stat.ML])
    This work studies the problem of transfer learning under the functional linear model framework, which aims to improve the fit of the target model by leveraging the knowledge from related source models. We measure the relatedness between target and source models using Reproducing Kernel Hilbert Spaces, allowing the type of knowledge being transferred to be interpreted by the structure of the spaces. Two algorithms are proposed: one transfers knowledge when the index of transferable sources is known, while the other one utilizes aggregation to achieve knowledge transfer without prior information about the sources. Furthermore, we establish the optimal convergence rates for excess risk, making the statistical gain via transfer learning mathematically provable. The effectiveness of the proposed algorithms is demonstrated on synthetic data as well as real financial data.
    Individually Fair Learning with One-Sided Feedback. (arXiv:2206.04475v1 [cs.LG])
    We consider an online learning problem with one-sided feedback, in which the learner is able to observe the true label only for positively predicted instances. On each round, $k$ instances arrive and receive classification outcomes according to a randomized policy deployed by the learner, whose goal is to maximize accuracy while deploying individually fair policies. We first extend the framework of Bechavod et al. (2020), which relies on the existence of a human fairness auditor for detecting fairness violations, to instead incorporate feedback from dynamically-selected panels of multiple, possibly inconsistent, auditors. We then construct an efficient reduction from our problem of online learning with one-sided feedback and a panel reporting fairness violations to the contextual combinatorial semi-bandit problem (Cesa-Bianchi & Lugosi, 2009, Gy\"{o}rgy et al., 2007). Finally, we show how to leverage the guarantees of two algorithms in the contextual combinatorial semi-bandit setting: Exp2 (Bubeck et al., 2012) and the oracle-efficient Context-Semi-Bandit-FTPL (Syrgkanis et al., 2016), to provide multi-criteria no regret guarantees simultaneously for accuracy and fairness. Our results eliminate two potential sources of bias from prior work: the "hidden outcomes" that are not available to an algorithm operating in the full information setting, and human biases that might be present in any single human auditor, but can be mitigated by selecting a well chosen panel.
    Ensembling Framework for Texture Extraction Techniques for Classification. (arXiv:2206.04158v1 [cs.CV])
    In the past few years, texture-based classification problems have proven their significance in many domains, from industrial inspection to health-related applications. New techniques and CNN-based architectures have been developed in recent years to solve texture-based classification problems. The limitation of these approaches is that none of them claims to be the best suited for all types of textures. Each technique has its advantage over a specific texture type. To address this issue, we are proposing a framework that combines existing techniques to extract texture features and displays better results than the present ones. The proposed framework works well on the most of the texture types, and in this framework, new techniques can also be added to achieve better results than existing ones. We are also presenting the SOTA results on FMD and KTH datasets by combining three existing techniques, using the proposed framework.
    Learning to Break Deep Perceptual Hashing: The Use Case NeuralHash. (arXiv:2111.06628v4 [cs.LG] UPDATED)
    Apple recently revealed its deep perceptual hashing system NeuralHash to detect child sexual abuse material (CSAM) on user devices before files are uploaded to its iCloud service. Public criticism quickly arose regarding the protection of user privacy and the system's reliability. In this paper, we present the first comprehensive empirical analysis of deep perceptual hashing based on NeuralHash. Specifically, we show that current deep perceptual hashing may not be robust. An adversary can manipulate the hash values by applying slight changes in images, either induced by gradient-based approaches or simply by performing standard image transformations, forcing or preventing hash collisions. Such attacks permit malicious actors easily to exploit the detection system: from hiding abusive material to framing innocent users, everything is possible. Moreover, using the hash values, inferences can still be made about the data stored on user devices. In our view, based on our results, deep perceptual hashing in its current form is generally not ready for robust client-side scanning and should not be used from a privacy perspective.
    Neural Prompt Search. (arXiv:2206.04673v1 [cs.CV])
    The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as "prompt modules" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has a good few-shot learning ability, and (iii) is domain-generalizable. The code and models are available at https://github.com/Davidzhangyuanhan/NOAH.
    CCP: Correlated Clustering and Projection for Dimensionality Reduction. (arXiv:2206.04189v1 [stat.ML])
    Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Residue-Similarity (R-S) scores and indexes, the shape of data in Riemannian manifolds, and algebraic topology-based persistent Laplacian are introduced for visualization and analysis. Proposed methods are validated with benchmark datasets associated with various machine learning algorithms.
    Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance. (arXiv:2205.02293v3 [cs.CL] UPDATED)
    Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CausalMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test direction match (whether the human translation directions in the training and test sets are aligned), and data-model direction match (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model direction mismatch highlighted by existing work on the impact of translationese. In light of our findings, we provide a set of suggestions for MT training and evaluation. Our code and data are at https://github.com/EdisonNi-hku/CausalMT
    Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint. (arXiv:2206.04569v1 [stat.ML])
    Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.  ( 2 min )
    Beyond Time-Average Convergence: Near-Optimal Uncoupled Online Learning via Clairvoyant Multiplicative Weights Update. (arXiv:2111.14737v3 [cs.GT] UPDATED)
    In this paper, we provide a novel and simple algorithm, Clairvoyant Multiplicative Weights Updates (CMWU) for regret minimization in general games. CMWU effectively corresponds to the standard MWU algorithm but where all agents, when updating their mixed strategies, use the payoff profiles based on tomorrow's behavior, i.e. the agents are clairvoyant. CMWU achieves constant regret of $\ln(m)/\eta$ in all normal-form games with m actions and fixed step-sizes $\eta$. Although CMWU encodes in its definition a fixed point computation, which in principle could result in dynamics that are neither computationally efficient nor uncoupled, we show that both of these issues can be largely circumvented. Specifically, as long as the step-size $\eta$ is upper bounded by $\frac{1}{(n-1)V}$, where $n$ is the number of agents and $[0,V]$ is the payoff range, then the CMWU updates can be computed linearly fast via a contraction map. This implementation results in an uncoupled online learning dynamic that admits a $o (\log T)$-sparse sub-sequence where each agent experiences at most $O(nV\log m)$ regret. This implies that the CMWU dynamics converge with rate $O(nV \log mW( T) / T)$ to a Coarse Correlated Equilibrium where $W(T)$ is the inverse of the function $g(t):=t\cdot 2^t$. The latter improves on the current state-of-the-art convergence rate of uncoupled online learning dynamics.
    Adversarial Noises Are Linearly Separable for (Nearly) Random Neural Networks. (arXiv:2206.04316v1 [cs.LG])
    Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-layer network with randomly initialized entries and the neural tangent kernel setup where the parameters are not far from initialization. The proof idea is to show the label information can be efficiently backpropagated to the input while keeping the linear separability. Our theory and experimental evidence further show that the linear classifier trained with the adversarial noises of the training data can well classify the adversarial noises of the test data, indicating that adversarial noises actually inject a distributional perturbation to the original data distribution. Furthermore, we empirically demonstrate that the adversarial noises may become less linearly separable when the above conditions are compromised while they are still much easier to classify than original features.  ( 2 min )
    Unveiling Transformers with LEGO: a synthetic reasoning task. (arXiv:2206.04301v1 [cs.LG])
    We propose a synthetic task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the transformer architecture learns this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we are able to understand (to some extent) some of the attention heads as well as how the information flows in the network. Based on these observations we propose a hypothesis that here pretraining helps merely due to being a smart initialization rather than some deep knowledge stored in the network. We also observe that in some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning, which impedes the model's ability to generalize to simple variants of the main task, and moreover we find that one can prevent such shortcut with appropriate architecture modification or careful data preparation. Motivated by our findings, we begin to explore the task of learning to execute C programs, where a convolutional modification to transformers, namely adding convolutional structures in the key/query/value maps, shows an encouraging edge.  ( 2 min )
    MEDIC: A Multi-Task Learning Dataset for Disaster Image Classification. (arXiv:2108.12828v4 [cs.CV] UPDATED)
    Recent research in disaster informatics demonstrates a practical and important use case of artificial intelligence to save human lives and suffering during natural disasters based on social media contents (text and images). While notable progress has been made using texts, research on exploiting the images remains relatively under-explored. To advance image-based approaches, we propose MEDIC (Available at: https://crisisnlp.qcri.org/medic/index.html), which is the largest social media image classification dataset for humanitarian response consisting of 71,198 images to address four different tasks in a multi-task learning setup. This is the first dataset of its kind: social media images, disaster response, and multi-task learning research. An important property of this dataset is its high potential to facilitate research on multi-task learning, which recently receives much interest from the machine learning community and has shown remarkable results in terms of memory, inference speed, performance, and generalization capability. Therefore, the proposed dataset is an important resource for advancing image-based disaster management and multi-task machine learning research. We experiment with different deep learning architectures and report promising results, which are above the majority baselines for all tasks. Along with the dataset, we also release all relevant scripts (https://github.com/firojalam/medic).
    ExpressivE: A Spatio-Functional Embedding For Knowledge Graph Completion. (arXiv:2206.04192v1 [cs.LG])
    Knowledge graphs are inherently incomplete. Therefore substantial research has been directed towards knowledge graph completion (KGC), i.e., predicting missing triples from the information represented in the knowledge graph (KG). Embedding models have yielded promising results for KGC, yet any current KGC embedding model is incapable of: (1) fully capturing vital inference patterns (e.g., composition), (2) capturing prominent logical rules jointly (e.g., hierarchy and composition), and (3) providing an intuitive interpretation of captured patterns. In this work, we propose ExpressivE, a fully expressive spatio-functional embedding model that solves all these challenges simultaneously. ExpressivE embeds pairs of entities as points and relations as hyper-parallelograms in the virtual triple space $\mathbb{R}^{2d}$. This model design allows ExpressivE not only to capture a rich set of inference patterns jointly but additionally to display any supported inference pattern through the spatial relation of hyper-parallelograms, offering an intuitive and consistent geometric interpretation of ExpressivE embeddings and their captured patterns. Experimental results on standard KGC benchmarks reveal that ExpressivE is competitive with state-of-the-art models and even significantly outperforms them on WN18RR.  ( 2 min )
    Neonatal EEG graded for severity of background abnormalities in hypoxic-ischaemic encephalopathy. (arXiv:2206.04420v1 [physics.med-ph])
    This report describes a set of neonatal electroencephalogram (EEG) recordings graded according to the severity of abnormalities in the background pattern. The dataset consists of 169 hours of multichannel EEG from 53 neonates recorded in a neonatal intensive care unit. All neonates received a diagnosis of hypoxic-ischaemic encephalopathy (HIE), the most common cause of brain injury in full term infants. For each neonate, multiple 1-hour epochs of good quality EEG were selected and then graded for background abnormalities. The grading system assesses EEG attributes such as amplitude and frequency, continuity, sleep-wake cycling, symmetry and synchrony, and abnormal waveforms. Background severity was then categorised into 4 grades: normal or mildly abnormal, moderately abnormal, severely abnormal, and inactive EEG. The data can be used as a reference set of multi-channel EEG for neonates with HIE, for EEG training purposes, or for developing and evaluating automated grading algorithms.  ( 2 min )
    Estimation in Rotationally Invariant Generalized Linear Models via Approximate Message Passing. (arXiv:2112.04330v2 [stat.ML] UPDATED)
    We consider the problem of signal estimation in generalized linear models defined via rotationally invariant design matrices. Since these matrices can have an arbitrary spectral distribution, this model is well suited for capturing complex correlation structures which often arise in applications. We propose a novel family of approximate message passing (AMP) algorithms for signal estimation, and rigorously characterize their performance in the high-dimensional limit via a state evolution recursion. Our rotationally invariant AMP has complexity of the same order as the existing AMP derived under the restrictive assumption of a Gaussian design; our algorithm also recovers this existing AMP as a special case. Numerical results showcase a performance close to Vector AMP (which is conjectured to be Bayes-optimal in some settings), but obtained with a much lower complexity, as the proposed algorithm does not require a computationally expensive singular value decomposition.
    Redundancy in Deep Linear Neural Networks. (arXiv:2206.04490v1 [cs.LG])
    Conventional wisdom states that deep linear neural networks benefit from expressiveness and optimization advantages over a single linear layer. This paper suggests that, in practice, the training process of deep linear fully-connected networks using conventional optimizers is convex in the same manner as a single linear fully-connected layer. This paper aims to explain this claim and demonstrate it. Even though convolutional networks are not aligned with this description, this work aims to attain a new conceptual understanding of fully-connected linear networks that might shed light on the possible constraints of convolutional settings and non-linear architectures.
    Trajectory-dependent Generalization Bounds for Deep Neural Networks via Fractional Brownian Motion. (arXiv:2206.04359v1 [cs.LG])
    Despite being tremendously overparameterized, it is appreciated that deep neural networks trained by stochastic gradient descent (SGD) generalize surprisingly well. Based on the Rademacher complexity of a pre-specified hypothesis set, different norm-based generalization bounds have been developed to explain this phenomenon. However, recent studies suggest these bounds might be problematic as they increase with the training set size, which is contrary to empirical evidence. In this study, we argue that the hypothesis set SGD explores is trajectory-dependent and thus may provide a tighter bound over its Rademacher complexity. To this end, we characterize the SGD recursion via a stochastic differential equation by assuming the incurred stochastic gradient noise follows the fractional Brownian motion. We then identify the Rademacher complexity in terms of the covering numbers and relate it to the Hausdorff dimension of the optimization trajectory. By invoking the hypothesis set stability, we derive a novel generalization bound for deep neural networks. Extensive experiments demonstrate that it predicts well the generalization gap over several common experimental interventions. We further show that the Hurst parameter of the fractional Brownian motion is more informative than existing generalization indicators such as the power-law index and the upper Blumenthal-Getoor index.
    Evaluating Aleatoric Uncertainty via Conditional Generative Models. (arXiv:2206.04287v1 [cs.LG])
    Aleatoric uncertainty quantification seeks for distributional knowledge of random responses, which is important for reliability analysis and robustness improvement in machine learning applications. Previous research on aleatoric uncertainty estimation mainly targets closed-formed conditional densities or variances, which requires strong restrictions on the data distribution or dimensionality. To overcome these restrictions, we study conditional generative models for aleatoric uncertainty estimation. We introduce two metrics to measure the discrepancy between two conditional distributions that suit these models. Both metrics can be easily and unbiasedly computed via Monte Carlo simulation of the conditional generative models, thus facilitating their evaluation and training. We demonstrate numerically how our metrics provide correct measurements of conditional distributional discrepancies and can be used to train conditional models competitive against existing benchmarks.  ( 2 min )
    Early Transferability of Adversarial Examples in Deep Neural Networks. (arXiv:2206.04472v1 [cs.LG])
    This paper will describe and analyze a new phenomenon that was not known before, which we call "Early Transferability". Its essence is that the adversarial perturbations transfer among different networks even at extremely early stages in their training. In fact, one can initialize two networks with two different independent choices of random weights and measure the angle between their adversarial perturbations after each step of the training. What we discovered was that these two adversarial directions started to align with each other already after the first few training steps (which typically use only a small fraction of the available training data), even though the accuracy of the two networks hadn't started to improve from their initial bad values due to the early stage of the training. The purpose of this paper is to present this phenomenon experimentally and propose plausible explanations for some of its properties.  ( 2 min )
    CFA: Coupled-hypersphere-based Feature Adaptation for Target-Oriented Anomaly Localization. (arXiv:2206.04325v1 [cs.CV])
    For a long time, anomaly localization has been widely used in industries. Previous studies focused on approximating the distribution of normal features without adaptation to a target dataset. However, since anomaly localization should precisely discriminate normal and abnormal features, the absence of adaptation may make the normality of abnormal features overestimated. Thus, we propose Coupled-hypersphere-based Feature Adaptation (CFA) which accomplishes sophisticated anomaly localization using features adapted to the target dataset. CFA consists of (1) a learnable patch descriptor that learns and embeds target-oriented features and (2) scalable memory bank independent of the size of the target dataset. And, CFA adopts transfer learning to increase the normal feature density so that abnormal features can be clearly distinguished by applying patch descriptor and memory bank to a pre-trained CNN. The proposed method outperforms the previous methods quantitatively and qualitatively. For example, it provides an AUROC score of 99.5% in anomaly detection and 98.5% in anomaly localization of MVTec AD benchmark. In addition, this paper points out the negative effects of biased features of pre-trained CNNs and emphasizes the importance of the adaptation to the target dataset. The code is publicly available at https://github.com/sungwool/CFA_for_anomaly_localization.  ( 2 min )
    Xplique: A Deep Learning Explainability Toolbox. (arXiv:2206.04394v1 [cs.LG])
    Today's most advanced machine-learning models are hardly scrutable. The key challenge for explainability methods is to help assisting researchers in opening up these black boxes, by revealing the strategy that led to a given decision, by characterizing their internal states or by studying the underlying data representation. To address this challenge, we have developed Xplique: a software library for explainability which includes representative explainability methods as well as associated evaluation metrics. It interfaces with one of the most popular learning libraries: Tensorflow as well as other libraries including PyTorch, scikit-learn and Theano. The code is licensed under the MIT license and is freely available at github.com/deel-ai/xplique.  ( 2 min )
    Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning. (arXiv:2206.04384v1 [cs.LG])
    World models in model-based reinforcement learning usually face unrealistic long-time-horizon prediction issues due to compounding errors as the prediction errors accumulate over timesteps. Recent works in graph-structured world models improve the long-horizon reasoning ability via building a graph to represent the environment, but they are designed in a goal-conditioned setting and cannot guide the agent to maximize episode returns in a traditional reinforcement learning setting without externally given target states. To overcome this limitation, we design a graph-structured world model in offline reinforcement learning by building a directed-graph-based Markov decision process (MDP) with rewards allocated to each directed edge as an abstraction of the original continuous environment. As our world model has small and finite state/action spaces compared to the original environment, value iteration can be easily applied here to estimate state values on the graph and figure out the best future. Unlike previous graph-structured world models that requires externally provided targets, our world model, dubbed Value Memory Graph (VMG), can provide the desired targets with high values by itself. VMG can be used to guide low-level goal-conditioned policies that are trained via supervised learning to maximize episode returns. Experiments on the D4RL benchmark show that VMG can outperform state-of-the-art methods in several tasks where long horizon reasoning ability is crucial. Code will be made publicly available.  ( 2 min )
    Learning to generate imaginary tasks for improving generalization in meta-learning. (arXiv:2206.04335v1 [cs.LG])
    The success of meta-learning on existing benchmarks is predicated on the assumption that the distribution of meta-training tasks covers meta-testing tasks. Frequent violation of the assumption in applications with either insufficient tasks or a very narrow meta-training task distribution leads to memorization or learner overfitting. Recent solutions have pursued augmentation of meta-training tasks, while it is still an open question to generate both correct and sufficiently imaginary tasks. In this paper, we seek an approach that up-samples meta-training tasks from the task representation via a task up-sampling network. Besides, the resulting approach named Adversarial Task Up-sampling (ATU) suffices to generate tasks that can maximally contribute to the latest meta-learner by maximizing an adversarial loss. On few-shot sine regression and image classification datasets, we empirically validate the marked improvement of ATU over state-of-the-art task augmentation strategies in the meta-testing performance and also the quality of up-sampled tasks.  ( 2 min )
    SDQ: Stochastic Differentiable Quantization with Mixed Precision. (arXiv:2206.04459v1 [cs.LG])
    In order to deploy deep models in a computationally efficient manner, model quantization approaches have been frequently used. In addition, as new hardware that supports mixed bitwidth arithmetic operations, recent research on mixed precision quantization (MPQ) begins to fully leverage the capacity of representation by searching optimized bitwidths for different layers and modules in a network. However, previous studies mainly search the MPQ strategy in a costly scheme using reinforcement learning, neural architecture search, etc., or simply utilize partial prior knowledge for bitwidth assignment, which might be biased and sub-optimal. In this work, we present a novel Stochastic Differentiable Quantization (SDQ) method that can automatically learn the MPQ strategy in a more flexible and globally-optimized space with smoother gradient approximation. Particularly, Differentiable Bitwidth Parameters (DBPs) are employed as the probability factors in stochastic quantization between adjacent bitwidth choices. After the optimal MPQ strategy is acquired, we further train our network with entropy-aware bin regularization and knowledge distillation. We extensively evaluate our method for several networks on different hardware (GPUs and FPGA) and datasets. SDQ outperforms all state-of-the-art mixed or single precision quantization with a lower bitwidth and is even better than the full-precision counterparts across various ResNet and MobileNet families, demonstrating the effectiveness and superiority of our method.  ( 2 min )
    Discriminative and Generative Learning for Linear Estimation of Random Signals [Lecture Notes]. (arXiv:2206.04432v1 [eess.SP])
    Inference tasks in signal processing are often characterized by the availability of reliable statistical modeling with some missing instance-specific parameters. One conventional approach uses data to estimate these missing parameters and then infers based on the estimated model. Alternatively, data can also be leveraged to directly learn the inference mapping end-to-end. These approaches for combining partially-known statistical models and data in inference are related to the notions of generative and discriminative models used in the machine learning literature, typically considered in the context of classifiers. The goal of this lecture note is to introduce the concepts of generative and discriminative learning for inference with a partially-known statistical model. While machine learning systems often lack the interpretability of traditional signal processing methods, we focus on a simple setting where one can interpret and compare the approaches in a tractable manner that is accessible and relevant to signal processing readers. In particular, we exemplify the approaches for the task of Bayesian signal estimation in a jointly Gaussian setting with the mean-squared error (MSE) objective, i.e., a linear estimation setting.  ( 2 min )
    On the Generalization and Adaption Performance of Causal Models. (arXiv:2206.04620v1 [cs.LG])
    Learning models that offer robust out-of-distribution generalization and fast adaptation is a key challenge in modern machine learning. Modelling causal structure into neural networks holds the promise to accomplish robust zero and few-shot adaptation. Recent advances in differentiable causal discovery have proposed to factorize the data generating process into a set of modules, i.e. one module for the conditional distribution of every variable where only causal parents are used as predictors. Such a modular decomposition of knowledge enables adaptation to distributions shifts by only updating a subset of parameters. In this work, we systematically study the generalization and adaption performance of such modular neural causal models by comparing it to monolithic models and structured models where the set of predictors is not constrained to causal parents. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes and offer robust generalization. We also found that the effects are more significant for sparser graphs as compared to denser graphs.  ( 2 min )
    Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk. (arXiv:2206.04436v1 [cs.LG])
    Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.  ( 2 min )
    A general approximation lower bound in $L^p$ norm, with applications to feed-forward neural networks. (arXiv:2206.04360v1 [cs.LG])
    We study the fundamental limits to the expressive power of neural networks. Given two sets $F$, $G$ of real-valued functions, we first prove a general lower bound on how well functions in $F$ can be approximated in $L^p(\mu)$ norm by functions in $G$, for any $p \geq 1$ and any probability measure $\mu$. The lower bound depends on the packing number of $F$, the range of $F$, and the fat-shattering dimension of $G$. We then instantiate this bound to the case where $G$ corresponds to a piecewise-polynomial feed-forward neural network, and describe in details the application to two sets $F$: H{\"o}lder balls and multivariate monotonic functions. Beside matching (known or new) upper bounds up to log factors, our lower bounds shed some light on the similarities or differences between approximation in $L^p$ norm or in sup norm, solving an open question by DeVore et al. (2021). Our proof strategy differs from the sup norm case and uses a key probability result of Mendelson (2002).
    ReDAL: Region-based and Diversity-aware Active Learning for Point Cloud Semantic Segmentation. (arXiv:2107.11769v3 [cs.CV] UPDATED)
    Despite the success of deep learning on supervised point cloud semantic segmentation, obtaining large-scale point-by-point manual annotations is still a significant challenge. To reduce the huge annotation burden, we propose a Region-based and Diversity-aware Active Learning (ReDAL), a general framework for many deep learning approaches, aiming to automatically select only informative and diverse sub-scene regions for label acquisition. Observing that only a small portion of annotated regions are sufficient for 3D scene understanding with deep learning, we use softmax entropy, color discontinuity, and structural complexity to measure the information of sub-scene regions. A diversity-aware selection algorithm is also developed to avoid redundant annotations resulting from selecting informative but similar regions in a querying batch. Extensive experiments show that our method highly outperforms previous active learning strategies, and we achieve the performance of 90% fully supervised learning, while less than 15% and 5% annotations are required on S3DIS and SemanticKITTI datasets, respectively. Our code is publicly available at https://github.com/tsunghan-wu/ReDAL.  ( 2 min )
    Convolutional Dictionary Learning by End-To-End Training of Iterative Neural Networks. (arXiv:2206.04447v1 [eess.IV])
    Sparsity-based methods have a long history in the field of signal processing and have been successfully applied to various image reconstruction problems. The involved sparsifying transformations or dictionaries are typically either pre-trained using a model which reflects the assumed properties of the signals or adaptively learned during the reconstruction - yielding so-called blind Compressed Sensing approaches. However, by doing so, the transforms are never explicitly trained in conjunction with the physical model which generates the signals. In addition, properly choosing the involved regularization parameters remains a challenging task. Another recently emerged training-paradigm for regularization methods is to use iterative neural networks (INNs) - also known as unrolled networks - which contain the physical model. In this work, we construct an INN which can be used as a supervised and physics-informed online convolutional dictionary learning algorithm. We evaluated the proposed approach by applying it to a realistic large-scale dynamic MR reconstruction problem and compared it to several other recently published works. We show that the proposed INN improves over two conventional model-agnostic training methods and yields competitive results also compared to a deep INN. Further, it does not require to choose the regularization parameters and - in contrast to deep INNs - each network component is entirely interpretable.  ( 2 min )
    OptWedge: Cognitive Optimized Guidance toward Off-screen POIs. (arXiv:2206.04293v1 [cs.HC])
    Guiding off-screen points of interest (POIs) is a practical way of providing additional information to users of small-screen devices, such as smart devices and head-mounted displays. Popular previous methods involve displaying a primitive figure referred to as Wedge on the screen for users to estimate off-screen POI on the invisible vertex. Because they utilize a cognitive process referred to as amodal completion, where users can imagine the entire figure even when a part of it is occluded, localization accuracy is influenced by bias and individual differences. To improve the accuracy, we propose to optimize the figure using a cognitive cost that considers the influence. We also design two types of optimizations with different parameters: unbiased OptWedge (UOW) and biased OptWedge (BOW). Experimental results indicate that OptWedge achieves more accurate guidance for a close distance compared to heuristics approach.
    Unsupervised Learning of the Total Variation Flow. (arXiv:2206.04406v1 [cs.CV])
    The total variation (TV) flow generates a scale-space representation of an image based on the TV functional. This gradient flow observes desirable features for images such as sharp edges and enables spectral, scale, and texture analysis. The standard numerical approach for TV flow requires solving multiple non-smooth optimisation problems. Even with state-of-the-art convex optimisation techniques, this is often prohibitively expensive and strongly motivates the use of alternative, faster approaches. Inspired by and extending the framework of physics-informed neural networks (PINNs), we propose the TVflowNET, a neural network approach to compute the solution of the TV flow given an initial image and a time instance. We significantly speed up the computation time by more than one order of magnitude and show that the TVflowNET approximates the TV flow solution with high fidelity. This is a preliminary report, more details are to follow.  ( 2 min )
    Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer. (arXiv:2206.04452v1 [cs.CV])
    Although autoregressive models have achieved promising results on image generation, their unidirectional generation process prevents the resultant images from fully reflecting global contexts. To address the issue, we propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process. As a generalized VQ-VAE, RQ-VAE first represents a high-resolution image as a sequence of discrete code stacks. After code stacks in the sequence are randomly masked, Contextual RQ-Transformer is trained to infill the masked code stacks based on the unmasked contexts of the image. Then, Contextual RQ-Transformer uses our two-phase decoding, Draft-and-Revise, and generates an image, while exploiting the global contexts of the image during the generation process. Specifically. in the draft phase, our model first focuses on generating diverse images despite rather low quality. Then, in the revise phase, the model iteratively improves the quality of images, while preserving the global contexts of generated images. In experiments, our method achieves state-of-the-art results on conditional image generation. We also validate that the Draft-and-Revise decoding can achieve high performance by effectively controlling the quality-diversity trade-off in image generation.  ( 2 min )
    Regret Analysis of Certainty Equivalence Policies in Continuous-Time Linear-Quadratic Systems. (arXiv:2206.04434v1 [cs.LG])
    This work studies theoretical performance guarantees of a ubiquitous reinforcement learning policy for controlling the canonical model of stochastic linear-quadratic system. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma for minimizing quadratic costs in linear dynamical systems that evolve according to stochastic differential equations. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.  ( 2 min )
    Multi-class Classification with Fuzzy-feature Observations: Theory and Algorithms. (arXiv:2206.04311v1 [cs.LG])
    The theoretical analysis of multi-class classification has proved that the existing multi-class classification methods can train a classifier with high classification accuracy on the test set, when the instances are precise in the training and test sets with same distribution and enough instances can be collected in the training set. However, one limitation with multi-class classification has not been solved: how to improve the classification accuracy of multi-class classification problems when only imprecise observations are available. Hence, in this paper, we propose a novel framework to address a new realistic problem called multi-class classification with imprecise observations (MCIMO), where we need to train a classifier with fuzzy-feature observations. Firstly, we give the theoretical analysis of the MCIMO problem based on fuzzy Rademacher complexity. Then, two practical algorithms based on support vector machine and neural networks are constructed to solve the proposed new problem. Experiments on both synthetic and real-world datasets verify the rationality of our theoretical analysis and the efficacy of the proposed algorithms.
    PyDTS: A Python Package for Discrete-Time Survival (Regularized) Regression with Competing Risks. (arXiv:2204.05731v2 [stat.ML] UPDATED)
    Time-to-event analysis (survival analysis) is used when the outcome or the response of interest is the time until a pre-specified event occurs. Time-to-event data are sometimes discrete either because time itself is discrete or due to grouping of failure times into intervals or rounding off measurements. In addition, the failure of an individual could be one of several distinct failure types; known as competing risks (events) data. This work focuses on discrete-time regression with competing events. We emphasize the main difference between the continuous and discrete settings with competing events, develop a new estimation procedure, and present PyDTS, an open source Python package which implements our estimation procedure and other tools for discrete-time-survival analysis with competing risks.
    Meet You Halfway: Explaining Deep Learning Mysteries. (arXiv:2206.04463v1 [cs.LG])
    Deep neural networks perform exceptionally well on various learning tasks with state-of-the-art results. While these models are highly expressive and achieve impressively accurate solutions with excellent generalization abilities, they are susceptible to minor perturbations. Samples that suffer such perturbations are known as "adversarial examples". Even though deep learning is an extensively researched field, many questions about the nature of deep learning models remain unanswered. In this paper, we introduce a new conceptual framework attached with a formal description that aims to shed light on the network's behavior and interpret the behind-the-scenes of the learning process. Our framework provides an explanation for inherent questions concerning deep learning. Particularly, we clarify: (1) Why do neural networks acquire generalization abilities? (2) Why do adversarial examples transfer between different models?. We provide a comprehensive set of experiments that support this new framework, as well as its underlying theory.
    Exploring Predictive States via Cantor Embeddings and Wasserstein Distance. (arXiv:2206.04198v1 [cond-mat.stat-mech])
    Predictive states for stochastic processes are a nonparametric and interpretable construct with relevance across a multitude of modeling paradigms. Recent progress on the self-supervised reconstruction of predictive states from time-series data focused on the use of reproducing kernel Hilbert spaces. Here, we examine how Wasserstein distances may be used to detect predictive equivalences in symbolic data. We compute Wasserstein distances between distributions over sequences ("predictions"), using a finite-dimensional embedding of sequences based on the Cantor for the underlying geometry. We show that exploratory data analysis using the resulting geometry via hierarchical clustering and dimension reduction provides insight into the temporal structure of processes ranging from the relatively simple (e.g., finite-state hidden Markov models) to the very complex (e.g., infinite-state indexed grammars).
    On Gradient Descent Convergence beyond the Edge of Stability. (arXiv:2206.04172v1 [cs.LG])
    Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability", where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability. In this work, we study a local condition for such an unstable convergence around a local minima in a low dimensional setting. We then leverage these insights to establish global convergence of a two-layer single-neuron ReLU student network aligning with the teacher neuron in a large learning rate beyond the Edge of Stability under population loss. Meanwhile, while the difference of norms of the two layers is preserved by gradient flow, we show that GD above the edge of stability induces a balancing effect, leading to the same norms across the layers.
    TreeFlow: Going beyond Tree-based Gaussian Probabilistic Regression. (arXiv:2206.04140v1 [cs.LG])
    The tree-based ensembles are known for their outstanding performance for classification and regression problems characterized by feature vectors represented by mixed-type variables from various ranges and domains. However, considering regression problems, they are primarily designed to provide deterministic responses or model the uncertainty of the output with a Gaussian distribution. In this work, we introduce TreeFlow, the tree-based approach that combines the benefits of using tree ensembles with capabilities of modeling flexible probability distributions using normalizing flows. The main idea of the solution is to use a tree-based model as a feature extractor and combine it with a conditional variant of normalizing flow. Consequently, our approach is capable of modeling complex distributions for the regression outputs. We evaluate the proposed method on challenging regression benchmarks with varying volume, feature characteristics, and target dimensionality. We obtain the SOTA results on datasets with non-gaussian target distributions and competitive results on gaussian ones compared to tree-based regression baselines.
    VN-Transformer: Rotation-Equivariant Attention for Vector Neurons. (arXiv:2206.04176v1 [cs.CV])
    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: $(i)$ we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; $(ii)$ we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; $(iii)$ we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; $(iv)$ we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.  ( 2 min )
    Analytical Composition of Differential Privacy via the Edgeworth Accountant. (arXiv:2206.04236v1 [cs.CR])
    Many modern machine learning algorithms are composed of simple private algorithms; thus, an increasingly important problem is to efficiently compute the overall privacy loss under composition. In this study, we introduce the Edgeworth Accountant, an analytical approach to composing differential privacy guarantees of private algorithms. The Edgeworth Accountant starts by losslessly tracking the privacy loss under composition using the $f$-differential privacy framework, which allows us to express the privacy guarantees using privacy-loss log-likelihood ratios (PLLRs). As the name suggests, this accountant next uses the Edgeworth expansion to the upper and lower bounds the probability distribution of the sum of the PLLRs. Moreover, by relying on a technique for approximating complex distributions using simple ones, we demonstrate that the Edgeworth Accountant can be applied to the composition of any noise-addition mechanism. Owing to certain appealing features of the Edgeworth expansion, the $(\epsilon, \delta)$-differential privacy bounds offered by this accountant are non-asymptotic, with essentially no extra computational cost, as opposed to the prior approaches in, wherein the running times increase with the number of compositions. Finally, we demonstrate that our upper and lower $(\epsilon, \delta)$-differential privacy bounds are tight in federated analytics and certain regimes of training private deep learning models.  ( 2 min )
    Deep Hierarchical Planning from Pixels. (arXiv:2206.04114v1 [cs.AI])
    Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.  ( 2 min )
    CASS: Cross Architectural Self-Supervision for Medical Image Analysis. (arXiv:2206.04170v1 [cs.CV])
    Recent advances in Deep Learning and Computer Vision have alleviated many of the bottlenecks, allowing algorithms to be label-free with better performance. Specifically, Transformers provide a global perspective of the image, which Convolutional Neural Networks (CNN) lack by design. Here we present \textbf{C}ross \textbf{A}rchitectural - \textbf{S}elf \textbf{S}upervision , a novel self-supervised learning approach which leverages transformers and CNN simultaneously, while also being computationally accessible to general practitioners via easily available cloud services. Compared to existing state-of-the-art self-supervised learning approaches, we empirically show CASS trained CNNs, and Transformers gained an average of 8.5\% with 100\% labelled data, 7.3\% with 10\% labelled data, and 11.5\% with 1\% labelled data, across three diverse datasets. Notably, one of the employed datasets included histopathology slides of an autoimmune disease, a topic underrepresented in Medical Imaging and has minimal data. In addition, our findings reveal that CASS is twice as efficient as other state-of-the-art methods in terms of training time.  ( 2 min )
    Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information. (arXiv:2206.04282v1 [cs.LG])
    In real-world reinforcement learning applications the learner's observation space is ubiquitously high-dimensional with both relevant and irrelevant information about the task at hand. Learning from high-dimensional observations has been the subject of extensive investigation in supervised learning and statistics (e.g., via sparsity), but analogous issues in reinforcement learning are not well understood, even in finite state/action (tabular) domains. We introduce a new problem setting for reinforcement learning, the Exogenous Markov Decision Process (ExoMDP), in which the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learner's actions, but evolves in an arbitrary, temporally correlated fashion. We provide a new algorithm, ExoRL, which learns a near-optimal policy with sample complexity polynomial in the size of the endogenous component and nearly independent of the size of the exogenous component, thereby offering a doubly-exponential improvement over off-the-shelf algorithms. Our results highlight for the first time that sample-efficient reinforcement learning is possible in the presence of exogenous information, and provide a simple, user-friendly benchmark for investigation going forward.  ( 2 min )
    Words are all you need? Capturing human sensory similarity with textual descriptors. (arXiv:2206.04105v1 [cs.CL])
    Recent advances in multimodal training use textual descriptions to significantly enhance machine understanding of images and videos. Yet, it remains unclear to what extent language can fully capture sensory experiences across different modalities. A well-established approach for characterizing sensory experiences relies on similarity judgments, namely, the degree to which people perceive two distinct stimuli as similar. We explore the relation between human similarity judgments and language in a series of large-scale behavioral studies ($N=1,823$ participants) across three modalities (images, audio, and video) and two types of text descriptors: simple word tags and free-text captions. In doing so, we introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general. We show that our prediction pipeline based on text descriptors exhibits excellent performance, and we compare it against a comprehensive array of 611 baseline models based on vision-, audio-, and video-processing architectures. We further show that the degree to which textual descriptors and models predict human similarity varies across and within modalities. Taken together, these studies illustrate the value of integrating machine learning and cognitive science approaches to better understand the similarities and differences between human and machine representations. We present an interactive visualization at https://words-are-all-you-need.s3.amazonaws.com/index.html for exploring the similarity between stimuli as experienced by humans and different methods reported in the paper.  ( 2 min )
    Simplifying Polylogarithms with Machine Learning. (arXiv:2206.04115v1 [cs.LG])
    Polylogrithmic functions, such as the logarithm or dilogarithm, satisfy a number of algebraic identities. For the logarithm, all the identities follow from the product rule. For the dilogarithm and higher-weight classical polylogarithms, the identities can involve five functions or more. In many calculations relevant to particle physics, complicated combinations of polylogarithms often arise from Feynman integrals. Although the initial expressions resulting from the integration usually simplify, it is often difficult to know which identities to apply and in what order. To address this bottleneck, we explore to what extent machine learning methods can help. We consider both a reinforcement learning approach, where the identities are analogous to moves in a game, and a transformer network approach, where the problem is viewed analogously to a language-translation task. While both methods are effective, the transformer network appears more powerful and holds promise for practical use in symbolic manipulation tasks in mathematical physics.  ( 2 min )
    A Comprehensive Survey of Graph-based Deep Learning Approaches for Anomaly Detection in Complex Distributed Systems. (arXiv:2206.04149v1 [cs.LG])
    Anomaly detection is an important problem for complex distributed systems consisting of hardware and software components. A thorough understanding of the requirements and challenges of anomaly detection for such systems is pivotal to the security of a system, especially for real-world deployment. While there have been many diverse research areas and application domains that deal with the problem, few have attempted to provide an in-depth look at such systems. Most anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. In this survey, we explore the significant potential of graph-based algorithms to identify and mitigate different types of anomalies in complex distributed heterogeneous systems. Our main focus is to provide an in-depth look at graphs when applied on heterogeneous computing devices spread across complex distributed systems. This study analyzes, compares, and contrasts the state-of-the-art research articles in the field. First, we describe the characteristics of the real-world distributed systems and their specific challenges of anomaly detection in such complex networks, such as data and evaluation, nature of the anomalies, and real-world requirements. Later, we discuss why graphs can be leveraged in such systems and the benefits of utilizing graphs. Then we will aptly delve into the state-of-the-art approaches and highlight their strength and weaknesses. Finally, we evaluate and compare these approaches and point out the areas for possible improvements.  ( 2 min )
    Likelihood-free Model Choice for Simulator-based Models with the Jensen--Shannon Divergence. (arXiv:2206.04110v1 [stat.ME])
    Choice of appropriate structure and parametric dimension of a model in the light of data has a rich history in statistical research, where the first seminal approaches were developed in 1970s, such as the Akaike's and Schwarz's model scoring criteria that were inspired by information theory and embodied the rationale called Occam's razor. After those pioneering works, model choice was quickly established as its own field of research, gaining considerable attention in both computer science and statistics. However, to date, there have been limited attempts to derive scoring criteria for simulator-based models lacking a likelihood expression. Bayes factors have been considered for such models, but arguments have been put both for and against use of them and around issues related to their consistency. Here we use the asymptotic properties of Jensen--Shannon divergence (JSD) to derive a consistent model scoring criterion for the likelihood-free setting called JSD-Razor. Relationships of JSD-Razor with established scoring criteria for the likelihood-based approach are analyzed and we demonstrate the favorable properties of our criterion using both synthetic and real modeling examples.  ( 2 min )
  • Open

    Generative Flow Networks for Discrete Probabilistic Modeling. (arXiv:2202.01361v2 [cs.LG] UPDATED)
    We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks. Code is publicly available at https://github.com/zdhNarsil/EB_GFN.  ( 2 min )
    Contextual Information-Directed Sampling. (arXiv:2205.10895v2 [cs.LG] UPDATED)
    Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient reinforcement learning algorithm. However, it is still unclear what is the right form of information ratio to optimize when contextual information is available. We investigate the IDS design through two contextual bandit problems: contextual bandits with graph feedback and sparse linear contextual bandits. We provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution. The main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditional IDS can be myopic. We further propose a computationally-efficient version of contextual IDS based on Actor-Critic and evaluate it empirically on a neural network contextual bandit.  ( 2 min )
    Estimation in Rotationally Invariant Generalized Linear Models via Approximate Message Passing. (arXiv:2112.04330v2 [stat.ML] UPDATED)
    We consider the problem of signal estimation in generalized linear models defined via rotationally invariant design matrices. Since these matrices can have an arbitrary spectral distribution, this model is well suited for capturing complex correlation structures which often arise in applications. We propose a novel family of approximate message passing (AMP) algorithms for signal estimation, and rigorously characterize their performance in the high-dimensional limit via a state evolution recursion. Our rotationally invariant AMP has complexity of the same order as the existing AMP derived under the restrictive assumption of a Gaussian design; our algorithm also recovers this existing AMP as a special case. Numerical results showcase a performance close to Vector AMP (which is conjectured to be Bayes-optimal in some settings), but obtained with a much lower complexity, as the proposed algorithm does not require a computationally expensive singular value decomposition.  ( 2 min )
    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (arXiv:2206.04615v1 [cs.CL])
    Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
    Globally Optimal Algorithms for Fixed-Budged Best Arm Identification. (arXiv:2206.04646v1 [stat.ML])
    We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the optimal rate as a result of global optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the misidentification probability, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To deal with this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v1 [cs.LG])
    Injecting noise within gradient descent has several desirable features. In this paper, we explore noise injection before computing a gradient step, which is known to have smoothing and regularizing properties. We show that small perturbations induce explicit regularization for simple finite-dimensional models based on the l1-norm, group l1-norms, or nuclear norms. When applied to overparametrized neural networks with large widths, we show that the same perturbations do not work due to variance explosion resulting from overparametrization. However, we also show that independent layer wise perturbations allow to avoid the exploding variance term, and explicit regularizers can then be obtained. We empirically show that the small perturbations lead to better generalization performance than vanilla (stochastic) gradient descent training, with minor adjustments to the training procedure.
    Learning Invariant Representations with Missing Data. (arXiv:2112.00881v2 [cs.LG] UPDATED)
    Spurious correlations allow flexible models to predict well during training but poorly on related test distributions. Recent work has shown that models that satisfy particular independencies involving correlation-inducing \textit{nuisance} variables have guarantees on their test performance. Enforcing such independencies requires nuisances to be observed during training. However, nuisances, such as demographics or image background labels, are often missing. Enforcing independence on just the observed data does not imply independence on the entire population. Here we derive \acrshort{mmd} estimators used for invariance objectives under missing nuisances. On simulations and clinical data, optimizing through these estimates achieves test performance similar to using estimators that make use of the full data.
    Hilbert Curve Projection Distance for Distribution Comparison. (arXiv:2205.15059v2 [cs.LG] UPDATED)
    Distribution comparison plays a central role in many machine learning tasks like data classification and generative modeling. In this study, we propose a novel metric, called Hilbert curve projection (HCP) distance, to measure the distance between two probability distributions with high robustness and low complexity. In particular, we first project two high-dimensional probability densities using Hilbert curve to obtain a coupling between them, and then calculate the transport distance between these two densities in the original space, according to the coupling. We show that HCP distance is a proper metric and is well-defined for absolutely continuous probability measures. Furthermore, we demonstrate that the empirical HCP distance converges to its population counterpart at a rate of no more than $O(n^{-1/2d})$ under regularity conditions. To suppress the curse-of-dimensionality, we also develop two variants of the HCP distance using (learnable) subspace projections. Experiments on both synthetic and real-world data show that our HCP distance works as an effective surrogate of the Wasserstein distance with low complexity and overcomes the drawbacks of the sliced Wasserstein distance.
    Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent. (arXiv:2002.04861v3 [stat.ML] UPDATED)
    We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior.
    Objective-Based Hierarchical Clustering of Deep Embedding Vectors. (arXiv:2012.08466v2 [cs.LG] UPDATED)
    We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).
    Cooperative learning for multi-view analysis. (arXiv:2112.12337v5 [stat.ME] UPDATED)
    We propose a new method for supervised learning with multiple sets of features ("views"). The multi-view problem is especially important in biology and medicine, where "-omics" data such as genomics, proteomics and radiomics are measured on a common set of samples. Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g. lasso, random forests, boosting, neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor onset prediction and breast ductal carcinoma in situ and invasive breast cancer classification. Leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
    Contrastive Regularization for Semi-Supervised Learning. (arXiv:2201.06247v2 [cs.LG] UPDATED)
    Consistency regularization on label predictions becomes a fundamental technique in semi-supervised learning, but it still requires a large number of training iterations for high performance. In this study, we analyze that the consistency regularization restricts the propagation of labeling information due to the exclusion of samples with unconfident pseudo-labels in the model updates. Then, we propose contrastive regularization to improve both efficiency and accuracy of the consistency regularization by well-clustered features of unlabeled data. In specific, after strongly augmented samples are assigned to clusters by their pseudo-labels, our contrastive regularization updates the model so that the features with confident pseudo-labels aggregate the features in the same cluster, while pushing away features in different clusters. As a result, the information of confident pseudo-labels can be effectively propagated into more unlabeled samples during training by the well-clustered features. On benchmarks of semi-supervised learning tasks, our contrastive regularization improves the previous consistency-based methods and achieves state-of-the-art results, especially with fewer training iterations. Our method also shows robust performance on open-set semi-supervised learning where unlabeled data includes out-of-distribution samples.
    Time Delay Estimation of Traffic Congestion Propagation based on Transfer Entropy. (arXiv:2108.06717v2 [stat.ML] UPDATED)
    Considering how congestion will propagate in the near future, understanding traffic congestion propagation has become crucial in GPS navigation systems for providing users with a more accurate estimated time of arrival (ETA). However, providing the exact ETA during congestion is a challenge owing to the complex propagation process between roads and high uncertainty regarding the future behavior of the process. Recent studies have focused on finding frequent congestion propagation patterns and determining the propagation probabilities. By contrast, this study proposes a novel time delay estimation method for traffic congestion propagation between roads using lag-specific transfer entropy (TE). Nonlinear normalization with a sliding window is used to effectively reveal the causal relationship between the source and target time series in calculating the TE. Moreover, Markov bootstrap techniques were adopted to quantify the uncertainty in the time delay estimator. To the best of our knowledge, the time delay estimation method presented in this article is the first to determine the time delay between roads for any congestion propagation pattern. The proposed method was validated using simulated data as well as real user trajectory data obtained from a major GPS navigation system applied in South Korea.
    Regret Bounds for Information-Directed Reinforcement Learning. (arXiv:2206.04640v1 [cs.LG])
    Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.
    DORA: Exploring outlier representations in Deep Neural Networks. (arXiv:2206.04530v1 [cs.LG])
    Deep Neural Networks (DNNs) draw their power from the representations they learn. In recent years, however, researchers have found that DNNs, while being incredibly effective in learning complex abstractions, also tend to be infected with artifacts, such as biases, Clever Hanses (CH), or Backdoors, due to spurious correlations inherent in the training data. So far, existing methods for uncovering such artifactual and malicious behavior in trained models focus on finding artifacts in the input data, which requires both availabilities of a data set and human intervention. In this paper, we introduce DORA (Data-agnOstic Representation Analysis): the first automatic data-agnostic method for the detection of potentially infected representations in Deep Neural Networks. We further show that contaminated representations found by DORA can be used to detect infected samples in any given dataset. We qualitatively and quantitatively evaluate the performance of our proposed method in both, controlled toy scenarios, and in real-world settings, where we demonstrate the benefit of DORA in safety-critical applications.
    Automatic Debiased Machine Learning for Dynamic Treatment Effects and General Nested Functionals. (arXiv:2203.13887v3 [econ.EM] UPDATED)
    We extend the idea of automated debiased machine learning to the dynamic treatment regime and more generally to nested functionals. We show that the multiply robust formula for the dynamic treatment regime with discrete treatments can be re-stated in terms of a recursive Riesz representer characterization of nested mean regressions. We then apply a recursive Riesz representer estimation learning algorithm that estimates de-biasing corrections without the need to characterize how the correction terms look like, such as for instance, products of inverse probability weighting terms, as is done in prior work on doubly robust estimation in the dynamic regime. Our approach defines a sequence of loss minimization problems, whose minimizers are the mulitpliers of the de-biasing correction, hence circumventing the need for solving auxiliary propensity models and directly optimizing for the mean squared error of the target de-biasing correction. We provide further applications of our approach to estimation of dynamic discrete choice models.
    Markovian Interference in Experiments. (arXiv:2206.02371v2 [cs.LG] UPDATED)
    We consider experiments in dynamical systems where interventions on some experimental units impact other units through a limiting constraint (such as a limited inventory). Despite outsize practical importance, the best estimators for this `Markovian' interference problem are largely heuristic in nature, and their bias is not well understood. We formalize the problem of inference in such experiments as one of policy evaluation. Off-policy estimators, while unbiased, apparently incur a large penalty in variance relative to state-of-the-art heuristics. We introduce an on-policy estimator: the Differences-In-Q's (DQ) estimator. We show that the DQ estimator can in general have exponentially smaller variance than off-policy evaluation. At the same time, its bias is second order in the impact of the intervention. This yields a striking bias-variance tradeoff so that the DQ estimator effectively dominates state-of-the-art alternatives. From a theoretical perspective, we introduce three separate novel techniques that are of independent interest in the theory of Reinforcement Learning (RL). Our empirical evaluation includes a set of experiments on a city-scale ride-hailing simulator.
    PyDTS: A Python Package for Discrete-Time Survival (Regularized) Regression with Competing Risks. (arXiv:2204.05731v2 [stat.ML] UPDATED)
    Time-to-event analysis (survival analysis) is used when the outcome or the response of interest is the time until a pre-specified event occurs. Time-to-event data are sometimes discrete either because time itself is discrete or due to grouping of failure times into intervals or rounding off measurements. In addition, the failure of an individual could be one of several distinct failure types; known as competing risks (events) data. This work focuses on discrete-time regression with competing events. We emphasize the main difference between the continuous and discrete settings with competing events, develop a new estimation procedure, and present PyDTS, an open source Python package which implements our estimation procedure and other tools for discrete-time-survival analysis with competing risks.
    On the Generalization and Adaption Performance of Causal Models. (arXiv:2206.04620v1 [cs.LG])
    Learning models that offer robust out-of-distribution generalization and fast adaptation is a key challenge in modern machine learning. Modelling causal structure into neural networks holds the promise to accomplish robust zero and few-shot adaptation. Recent advances in differentiable causal discovery have proposed to factorize the data generating process into a set of modules, i.e. one module for the conditional distribution of every variable where only causal parents are used as predictors. Such a modular decomposition of knowledge enables adaptation to distributions shifts by only updating a subset of parameters. In this work, we systematically study the generalization and adaption performance of such modular neural causal models by comparing it to monolithic models and structured models where the set of predictors is not constrained to causal parents. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes and offer robust generalization. We also found that the effects are more significant for sparser graphs as compared to denser graphs.  ( 2 min )
    Vector Optimization with Stochastic Bandit Feedback. (arXiv:2110.12311v3 [cs.LG] UPDATED)
    We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of the Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study an ($\epsilon,\delta$)-PAC Pareto set identification problem where an evaluation of each design yields a noisy observation of the mean reward vector. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of {\em ordering complexity}, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We provide gap-dependent and worst-case lower bounds on the sample complexity and show that in the worst-case the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the na\"ive elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.  ( 2 min )
    Individually Fair Learning with One-Sided Feedback. (arXiv:2206.04475v1 [cs.LG])
    We consider an online learning problem with one-sided feedback, in which the learner is able to observe the true label only for positively predicted instances. On each round, $k$ instances arrive and receive classification outcomes according to a randomized policy deployed by the learner, whose goal is to maximize accuracy while deploying individually fair policies. We first extend the framework of Bechavod et al. (2020), which relies on the existence of a human fairness auditor for detecting fairness violations, to instead incorporate feedback from dynamically-selected panels of multiple, possibly inconsistent, auditors. We then construct an efficient reduction from our problem of online learning with one-sided feedback and a panel reporting fairness violations to the contextual combinatorial semi-bandit problem (Cesa-Bianchi & Lugosi, 2009, Gy\"{o}rgy et al., 2007). Finally, we show how to leverage the guarantees of two algorithms in the contextual combinatorial semi-bandit setting: Exp2 (Bubeck et al., 2012) and the oracle-efficient Context-Semi-Bandit-FTPL (Syrgkanis et al., 2016), to provide multi-criteria no regret guarantees simultaneously for accuracy and fairness. Our results eliminate two potential sources of bias from prior work: the "hidden outcomes" that are not available to an algorithm operating in the full information setting, and human biases that might be present in any single human auditor, but can be mitigated by selecting a well chosen panel.  ( 2 min )
    The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training. (arXiv:2007.12826v3 [stat.ML] UPDATED)
    Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).  ( 3 min )
    Generalization and Robustness Implications in Object-Centric Learning. (arXiv:2107.00637v3 [cs.LG] UPDATED)
    The idea behind object-centric representation learning is that natural scenes can better be modeled as compositions of objects and their relations as opposed to distributed representations. This inductive bias can be injected into neural networks to potentially improve systematic generalization and performance of downstream tasks in scenes with multiple objects. In this paper, we train state-of-the-art unsupervised models on five common multi-object datasets and evaluate segmentation metrics and downstream object property prediction. In addition, we study generalization and robustness by investigating the settings where either a single object is out of distribution -- e.g., having an unseen color, texture, or shape -- or global properties of the scene are altered -- e.g., by occlusions, cropping, or increasing the number of objects. From our experimental study, we find object-centric representations to be useful for downstream tasks and generally robust to most distribution shifts affecting objects. However, when the distribution shift affects the input in a less structured manner, robustness in terms of segmentation and downstream task performance may vary significantly across models and distribution shifts.  ( 2 min )
    Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint. (arXiv:2206.04569v1 [stat.ML])
    Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.  ( 2 min )
    Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models. (arXiv:2206.04589v1 [cs.DS])
    We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(\epsilon \sqrt{\log(1/\epsilon)})$. Similarly, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(\epsilon \log(1/\epsilon))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.  ( 2 min )
    On Margins and Generalisation for Voting Classifiers. (arXiv:2206.04607v1 [cs.LG])
    We study the generalisation properties of majority voting on finite ensembles of classifiers, proving margin-based generalisation bounds via the PAC-Bayes theory. These provide state-of-the-art guarantees on a number of classification tasks. Our central results leverage the Dirichlet posteriors studied recently by Zantedeschi et al. [2021] for training voting classifiers; in contrast to that work our bounds apply to non-randomised votes via the use of margins. Our contributions add perspective to the debate on the "margins theory" proposed by Schapire et al. [1998] for the generalisation of ensemble classifiers.  ( 2 min )
    A Spectral Representation of Kernel Stein Discrepancy with Application to Goodness-of-Fit Tests for Measures on Infinite Dimensional Hilbert Spaces. (arXiv:2206.04552v1 [math.ST])
    Kernel Stein discrepancy (KSD) is a widely used kernel-based non-parametric measure of discrepancy between probability measures. It is often employed in the scenario where a user has a collection of samples from a candidate probability measure and wishes to compare them against a specified target probability measure. A useful property of KSD is that it may be calculated with samples from only the candidate measure and without knowledge of the normalising constant of the target measure. KSD has been employed in a range of settings including goodness-of-fit testing, parametric inference, MCMC output assessment and generative modelling. Two main issues with current KSD methodology are (i) the lack of applicability beyond the finite dimensional Euclidean setting and (ii) a lack of clarity on what influences KSD performance. This paper provides a novel spectral representation of KSD which remedies both of these, making KSD applicable to Hilbert-valued data and revealing the impact of kernel and Stein operator choice on the KSD. We demonstrate the efficacy of the proposed methodology by performing goodness-of-fit tests for various Gaussian and non-Gaussian functional models in a number of synthetic data experiments.  ( 2 min )
    Overcoming the Spectral Bias of Neural Value Approximation. (arXiv:2206.04672v1 [cs.LG])
    Value approximation using deep neural networks is at the heart of off-policy deep reinforcement learning, and is often the primary module that provides learning signals to the rest of the algorithm. While multi-layer perceptron networks are universal function approximators, recent works in neural kernel regression suggest the presence of a spectral bias, where fitting high-frequency components of the value function requires exponentially more gradient update steps than the low-frequency ones. In this work, we re-examine off-policy reinforcement learning through the lens of kernel regression and propose to overcome such bias via a composite neural tangent kernel. With just a single line-change, our approach, the Fourier feature networks (FFN) produce state-of-the-art performance on challenging continuous control domains with only a fraction of the compute. Faster convergence and better off-policy stability also make it possible to remove the target network without suffering catastrophic divergences, which further reduces TD}(0)'s estimation bias on a few tasks.  ( 2 min )
    What is a Good Metric to Study Generalization of Minimax Learners?. (arXiv:2206.04502v1 [stat.ML])
    Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in minimax settings, their generalization guarantees in the stochastic setting, i.e., how the solution trained on empirical data performs on the unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization, fails in simple examples of minimax problems. Furthermore, another popular metric, the primal-dual risk, also fails to characterize the generalization behavior for minimax problems with nonconvexity, due to non-existence of saddle points. We thus propose a new metric to study generalization of minimax learners: the primal gap, to circumvent these issues. Next, we derive generalization bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization bounds for primal risk and primal-dual risk in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms -- gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization.  ( 2 min )
    Deep Hierarchical Planning from Pixels. (arXiv:2206.04114v1 [cs.AI])
    Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.  ( 2 min )
    A Simple Unified Approach to Testing High-Dimensional Conditional Independences for Categorical and Ordinal Data. (arXiv:2206.04356v1 [stat.ML])
    Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.  ( 2 min )
    On Transfer Learning in Functional Linear Regression. (arXiv:2206.04277v1 [stat.ML])
    This work studies the problem of transfer learning under the functional linear model framework, which aims to improve the fit of the target model by leveraging the knowledge from related source models. We measure the relatedness between target and source models using Reproducing Kernel Hilbert Spaces, allowing the type of knowledge being transferred to be interpreted by the structure of the spaces. Two algorithms are proposed: one transfers knowledge when the index of transferable sources is known, while the other one utilizes aggregation to achieve knowledge transfer without prior information about the sources. Furthermore, we establish the optimal convergence rates for excess risk, making the statistical gain via transfer learning mathematically provable. The effectiveness of the proposed algorithms is demonstrated on synthetic data as well as real financial data.  ( 2 min )
    Choosing Answers in $\varepsilon$-Best-Answer Identification for Linear Bandits. (arXiv:2206.04456v1 [stat.ML])
    In pure-exploration problems, information is gathered sequentially to answer a question on the stochastic environment. While best-arm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\varepsilon$-close to the best one (and not exactly the best one). In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct. We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity. Instead, a \textit{furthest answer} should be identified. Using that insight to choose the candidate answer carefully, we develop a simple procedure to adapt best-arm identification algorithms to tackle $\varepsilon$-best-answer identification in transductive linear stochastic bandits. Finally, we propose an asymptotically optimal algorithm for this setting, which is shown to achieve competitive empirical performance against existing modified best-arm identification algorithms.  ( 2 min )
    Uplifting Bandits. (arXiv:2206.04091v1 [stat.ML])
    We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them. After each action, the agent observes the realizations of all the variables. This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers, such as clicks. We propose UCB-style algorithms that estimate the uplifts of the actions over a baseline. We study multiple variants of the problem, including when the baseline and affected variables are unknown, and prove sublinear regret bounds for all of these. We also provide lower bounds that justify the necessity of our modeling assumptions. Experiments on synthetic and real-world datasets show the benefit of methods that estimate the uplifts over policies that do not use this structure.  ( 2 min )
    Regret Analysis of Certainty Equivalence Policies in Continuous-Time Linear-Quadratic Systems. (arXiv:2206.04434v1 [cs.LG])
    This work studies theoretical performance guarantees of a ubiquitous reinforcement learning policy for controlling the canonical model of stochastic linear-quadratic system. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma for minimizing quadratic costs in linear dynamical systems that evolve according to stochastic differential equations. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.  ( 2 min )
    A general approximation lower bound in $L^p$ norm, with applications to feed-forward neural networks. (arXiv:2206.04360v1 [cs.LG])
    We study the fundamental limits to the expressive power of neural networks. Given two sets $F$, $G$ of real-valued functions, we first prove a general lower bound on how well functions in $F$ can be approximated in $L^p(\mu)$ norm by functions in $G$, for any $p \geq 1$ and any probability measure $\mu$. The lower bound depends on the packing number of $F$, the range of $F$, and the fat-shattering dimension of $G$. We then instantiate this bound to the case where $G$ corresponds to a piecewise-polynomial feed-forward neural network, and describe in details the application to two sets $F$: H{\"o}lder balls and multivariate monotonic functions. Beside matching (known or new) upper bounds up to log factors, our lower bounds shed some light on the similarities or differences between approximation in $L^p$ norm or in sup norm, solving an open question by DeVore et al. (2021). Our proof strategy differs from the sup norm case and uses a key probability result of Mendelson (2002).  ( 2 min )
    Adversarial Noises Are Linearly Separable for (Nearly) Random Neural Networks. (arXiv:2206.04316v1 [cs.LG])
    Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-layer network with randomly initialized entries and the neural tangent kernel setup where the parameters are not far from initialization. The proof idea is to show the label information can be efficiently backpropagated to the input while keeping the linear separability. Our theory and experimental evidence further show that the linear classifier trained with the adversarial noises of the training data can well classify the adversarial noises of the test data, indicating that adversarial noises actually inject a distributional perturbation to the original data distribution. Furthermore, we empirically demonstrate that the adversarial noises may become less linearly separable when the above conditions are compromised while they are still much easier to classify than original features.  ( 2 min )
    Evaluating Aleatoric Uncertainty via Conditional Generative Models. (arXiv:2206.04287v1 [cs.LG])
    Aleatoric uncertainty quantification seeks for distributional knowledge of random responses, which is important for reliability analysis and robustness improvement in machine learning applications. Previous research on aleatoric uncertainty estimation mainly targets closed-formed conditional densities or variances, which requires strong restrictions on the data distribution or dimensionality. To overcome these restrictions, we study conditional generative models for aleatoric uncertainty estimation. We introduce two metrics to measure the discrepancy between two conditional distributions that suit these models. Both metrics can be easily and unbiasedly computed via Monte Carlo simulation of the conditional generative models, thus facilitating their evaluation and training. We demonstrate numerically how our metrics provide correct measurements of conditional distributional discrepancies and can be used to train conditional models competitive against existing benchmarks.  ( 2 min )
    Exploring Predictive States via Cantor Embeddings and Wasserstein Distance. (arXiv:2206.04198v1 [cond-mat.stat-mech])
    Predictive states for stochastic processes are a nonparametric and interpretable construct with relevance across a multitude of modeling paradigms. Recent progress on the self-supervised reconstruction of predictive states from time-series data focused on the use of reproducing kernel Hilbert spaces. Here, we examine how Wasserstein distances may be used to detect predictive equivalences in symbolic data. We compute Wasserstein distances between distributions over sequences ("predictions"), using a finite-dimensional embedding of sequences based on the Cantor for the underlying geometry. We show that exploratory data analysis using the resulting geometry via hierarchical clustering and dimension reduction provides insight into the temporal structure of processes ranging from the relatively simple (e.g., finite-state hidden Markov models) to the very complex (e.g., infinite-state indexed grammars).  ( 2 min )
    On Gradient Descent Convergence beyond the Edge of Stability. (arXiv:2206.04172v1 [cs.LG])
    Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability", where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability. In this work, we study a local condition for such an unstable convergence around a local minima in a low dimensional setting. We then leverage these insights to establish global convergence of a two-layer single-neuron ReLU student network aligning with the teacher neuron in a large learning rate beyond the Edge of Stability under population loss. Meanwhile, while the difference of norms of the two layers is preserved by gradient flow, we show that GD above the edge of stability induces a balancing effect, leading to the same norms across the layers.  ( 2 min )
    CCP: Correlated Clustering and Projection for Dimensionality Reduction. (arXiv:2206.04189v1 [stat.ML])
    Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Residue-Similarity (R-S) scores and indexes, the shape of data in Riemannian manifolds, and algebraic topology-based persistent Laplacian are introduced for visualization and analysis. Proposed methods are validated with benchmark datasets associated with various machine learning algorithms.  ( 2 min )
    Robust Matrix Completion with Heavy-tailed Noise. (arXiv:2206.04276v1 [math.ST])
    This paper studies low-rank matrix completion in the presence of heavy-tailed and possibly asymmetric noise, where we aim to estimate an underlying low-rank matrix given a set of highly incomplete noisy entries. Though the matrix completion problem has attracted much attention in the past decade, there is still lack of theoretical understanding when the observations are contaminated by heavy-tailed noises. Prior theory falls short of explaining the empirical results and is unable to capture the optimal dependence of the estimation error on the noise level. In this paper, we adopt an adaptive Huber loss to accommodate heavy-tailed noise, which is robust against large and possibly asymmetric errors when the parameter in the loss function is carefully designed to balance the Huberization biases and robustness to outliers. Then, we propose an efficient nonconvex algorithm via a balanced low-rank Burer-Monteiro matrix factorization and gradient decent with robust spectral initialization. We prove that under merely bounded second moment condition on the error distributions, rather than the sub-Gaussian assumption, the Euclidean error of the iterates generated by the proposed algorithm decrease geometrically fast until achieving a minimax-optimal statistical estimation error, which has the same order as that in the sub-Gaussian case. The key technique behind this significant advancement is a powerful leave-one-out analysis framework. The theoretical results are corroborated by our simulation studies.  ( 2 min )
    Conformal Off-Policy Prediction in Contextual Bandits. (arXiv:2206.04405v1 [stat.ML])
    Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.  ( 2 min )
    Words are all you need? Capturing human sensory similarity with textual descriptors. (arXiv:2206.04105v1 [cs.CL])
    Recent advances in multimodal training use textual descriptions to significantly enhance machine understanding of images and videos. Yet, it remains unclear to what extent language can fully capture sensory experiences across different modalities. A well-established approach for characterizing sensory experiences relies on similarity judgments, namely, the degree to which people perceive two distinct stimuli as similar. We explore the relation between human similarity judgments and language in a series of large-scale behavioral studies ($N=1,823$ participants) across three modalities (images, audio, and video) and two types of text descriptors: simple word tags and free-text captions. In doing so, we introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general. We show that our prediction pipeline based on text descriptors exhibits excellent performance, and we compare it against a comprehensive array of 611 baseline models based on vision-, audio-, and video-processing architectures. We further show that the degree to which textual descriptors and models predict human similarity varies across and within modalities. Taken together, these studies illustrate the value of integrating machine learning and cognitive science approaches to better understand the similarities and differences between human and machine representations. We present an interactive visualization at https://words-are-all-you-need.s3.amazonaws.com/index.html for exploring the similarity between stimuli as experienced by humans and different methods reported in the paper.  ( 2 min )
    GCVAE: Generalized-Controllable Variational AutoEncoder. (arXiv:2206.04225v1 [stat.ML])
    Variational autoencoders (VAEs) have recently been used for unsupervised disentanglement learning of complex density distributions. Numerous variants exist to encourage disentanglement in latent space while improving reconstruction. However, none have simultaneously managed the trade-off between attaining extremely low reconstruction error and a high disentanglement score. We present a generalized framework to handle this challenge under constrained optimization and demonstrate that it outperforms state-of-the-art existing models as regards disentanglement while balancing reconstruction. We introduce three controllable Lagrangian hyperparameters to control reconstruction loss, KL divergence loss and correlation measure. We prove that maximizing information in the reconstruction network is equivalent to information maximization during amortized inference under reasonable assumptions and constraint relaxation.  ( 2 min )
    ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret. (arXiv:2206.04122v1 [cs.GT])
    Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability in the tabular case. We show that the variance of the estimated regret of a tabular version of ESCHER with an oracle value function is significantly lower than that of outcome sampling MCCFR and tabular DREAM with an oracle value function. We then show that a deep learning version of ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- and the difference becomes dramatic as game size increases.  ( 2 min )
    Analytical Composition of Differential Privacy via the Edgeworth Accountant. (arXiv:2206.04236v1 [cs.CR])
    Many modern machine learning algorithms are composed of simple private algorithms; thus, an increasingly important problem is to efficiently compute the overall privacy loss under composition. In this study, we introduce the Edgeworth Accountant, an analytical approach to composing differential privacy guarantees of private algorithms. The Edgeworth Accountant starts by losslessly tracking the privacy loss under composition using the $f$-differential privacy framework, which allows us to express the privacy guarantees using privacy-loss log-likelihood ratios (PLLRs). As the name suggests, this accountant next uses the Edgeworth expansion to the upper and lower bounds the probability distribution of the sum of the PLLRs. Moreover, by relying on a technique for approximating complex distributions using simple ones, we demonstrate that the Edgeworth Accountant can be applied to the composition of any noise-addition mechanism. Owing to certain appealing features of the Edgeworth expansion, the $(\epsilon, \delta)$-differential privacy bounds offered by this accountant are non-asymptotic, with essentially no extra computational cost, as opposed to the prior approaches in, wherein the running times increase with the number of compositions. Finally, we demonstrate that our upper and lower $(\epsilon, \delta)$-differential privacy bounds are tight in federated analytics and certain regimes of training private deep learning models.  ( 2 min )
    Applying separative non-negative matrix factorization to extra-financial data. (arXiv:2206.04350v1 [q-fin.CP])
    We present here an original application of the non-negative matrix factorization (NMF) method, for the case of extra-financial data. These data are subject to high correlations between co-variables, as well as between observations. NMF provides a much more relevant clustering of co-variables and observations than a simple principal component analysis (PCA). In addition, we show that an initial data separation step before applying NMF further improves the quality of the clustering.  ( 2 min )
    Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. (arXiv:2206.04119v1 [q-bio.BM])
    Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)-equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the large-compute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.  ( 2 min )
  • Open

    How much do reward engineers make?
    The biggest influence I had on the performance of a method was through the reward, and how and what components are weighted. In fact, this has had a bigger impact than fiddling with hyperparameters that couldn't be autotuned. It's the most intrinsic bias I've found to be effective at meeting time/compute constraints without compromising performance. In other words, whether a method worked or not depended on its reward. What's the demand for reward engineers? submitted by /u/XecutionStyle [link] [comments]  ( 1 min )

  • Open

    How service providers can use natural language processing to gain insights from customer tickets with Amazon Comprehend
    Today, customers can raise support tickets through multiple channels like – web, mobile, chat-bots, emails, or phone calls. When a support ticket is raised by a customer, it is processed and assigned to a category based on the information provided in the ticket. It is then routed to the support group for resolution according to […]  ( 14 min )
    Incremental training with Amazon SageMaker JumpStart
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). SageMaker JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions […]  ( 9 min )
    How eMagazines utilizes Amazon Polly to voice articles for school-aged kids
    This is a guest post by Andrew Degenholtz, CEO and Founder of eMagazines, the parent company of ReadAlong.ai. eMagazines’ technology seamlessly transforms print products into premium digital and audio experiences. Leveraging Amazon technology, ReadAlong.ai offers a simple, turn-key way for publishers to add audio to their websites with a single line of code. eMagazines supports […]  ( 7 min )
    Weekly forecasts can now start on Sunday with Amazon Forecast
    We are excited to announce that in Amazon Forecast, you can now start your forecast horizon at custom starting points, including on Sundays for weekly forecasts. This allows you to more closely align demand planning forecasts to local business practices and operational requirements. Forecast is a fully managed service that uses statistical and machine learning […]  ( 6 min )
    Continuously monitor predictor accuracy with Amazon Forecast
    We’re excited to announce that you can now automatically monitor the accuracy of your Amazon Forecast predictors over time. As new data is provided, Forecast automatically computes predictor accuracy metrics, providing you with more information to decide whether to keep using, retrain, or create new predictors. Monitoring predictor quality and identifying deterioration in accuracy over […]  ( 9 min )
    Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot
    Data fuels machine learning (ML); the quality of data has a direct impact on the quality of ML models. Therefore, improving data quality and employing the right feature engineering techniques are critical to creating accurate ML models. ML practitioners often tediously iterate on feature engineering, choice of algorithms, and other aspects of ML in search […]  ( 10 min )
  • Open

    [R] Decentralized Training of Foundation Models in Heterogeneous Environments
    Paper: https://arxiv.org/abs/2206.01288 Abstract: Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model …  ( 1 min )
    [R] Extreme Compression for Pre-trained Transformers Made Simple and Efficient - Microsoft 2022
    Paper: https://arxiv.org/abs/2206.01859 Abstract: Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks. ​ https://preview.redd.it/kgbjncheeo491.jpg?width=1187&format=pjpg&auto=webp&s=ffa0963f0c0dd9a2ab9163d5ec6dc8d43584ece0 https://preview.redd.it/7ioxmuqeeo491.jpg?width=577&format=pjpg&auto=webp&s=e2d5eec7274bcfe2c9bb66cb0b0256af5d4594a6 https://preview.redd.it/y5wcth7feo491.jpg?width=1151&format=pjpg&auto=webp&s=bb96b98862987cd2900d7d0906259190c3ad154b submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    New Insights on Infant Word Lear[N]ing - Implications for optimizing machine learning and second language learning
    https://community.chatwithastrid.com/aprendiendo-espanol-76iwwk5y/post/new-insight-on-how-babies-learn-words-M3dUgc6rrRb83ZD submitted by /u/InstrumentalAsylum [link] [comments]  ( 1 min )
    [D] Request for moderators
    If you frequently visit r/ml throughout the day, have a good understanding of the field, and a history of constructive comments/posts, then we need your help as a moderator. Please apply by sending us a modmail with the following info: Your role (engineer, student, researcher, self-taught, etc) and years of experience in ML Amount of time available to spend on the sub (you must check the sub quite regularly throughout the day) Your time zone We’re specifically looking for friendly people that have at least a year or two of experience in ML who understand the current research and industry landscape and who have been on r/ml long enough to understand what the community expects in terms of moderation. Thanks! submitted by /u/dojoteef [link] [comments]  ( 1 min )
    [R] Blazingly Fast Computer Vision Training with the Mosaic ResNet and Composer
    Hey all! MosaicML is excited to release the Mosaic ResNet, which trains to a 76.6% classification accuracy in 27 minutes, 7x faster than NVIDIA's ResNet baseline, using only vanilla PyTorch. These recipes modify the training algorithm; the network architecture is the same ResNet you’ve known and loved since 2015 (with updated anti-aliasing pooling via Blurpool). See all of the details in our blog post! The figure below summarizes our three training recipes (exact recipes available here). You can check out the complete results of the hundreds of training runs we conducted to create these recipes using Explorer, our tool for evaluating the efficiency of training algorithms. Comparison between best MosaicML ResNet-50 Recipe for a given Time & Accuracy (i.e. the Pareto frontier) to different baselines. Data collected on the MosaicML Cloud (8x NVIDIA A100). These results push on the interplay between algorithmic science and systems engineering, providing segmented cases for research like FFCV Dataloaders, Sharpness-Aware Minimization, and novel, MosaicML algorithms such as ColOut. MosaicML's release of \"training recipes\", which permit a user to trade off between accuracy and runtime. Want to verify our results? Want to beat ours? Or just want to speed up your own model training? Head over to our our GitHub repo, https://github.com/mosaicml/composer, which enables this research, and star it ⭐️ to keep up with the latest updates! And stay tuned for a much deeper dive on all the details, a comprehensive write-up on the science and engineering of this work, next week! https://preview.redd.it/falrstlytn491.png?width=1498&format=png&auto=webp&s=e7eb5413816b18e2f13efb89681c5e451a41aa64 submitted by /u/moinnadeem [link] [comments]  ( 6 min )
    [D] G.Hinton's ML-driven explanation of the role of the sleep - inquiry about further sources.
    In the recent episode of Peter Abbeel's "The Robot Brains" podcast, G.Hinton explains a fascinating hypothesis behind the role of sleep in our lives ("sleep is the process of forgetting negative examples in human contrastive learning framework"). However, he does it in a very general way. Does anybody know where I could read more about that? Academic papers etc.? Reference: https://youtu.be/2EDP4v-9TUA submitted by /u/dtransposed [link] [comments]  ( 2 min )
    [R] More ethical machine learning using model cards at Wikimedia
    Abstract, 10 minute video, and transcript from May 2022 Apply(conf): First proposed by Mitchell et al. in 2018, model cards are a form of transparent reporting of machine learning models, their uses, and performance for public audiences. As part of a broader effort to strengthen our ethical approaches to machine learning at Wikimedia, we started implementing model cards for every model hosted by the Foundation. This talk is a description of our process, motivation, and lessons learned along the way. https://www.youtube.com/watch?v=t4GMq7MC7Js https://www.tecton.ai/apply/session-video-archive/more-ethical-machine-learning-using-model-card-at-wikimedia/ submitted by /u/Competitive_Travel16 [link] [comments]  ( 1 min )
    [P] Set-up Yolo-V5 distributed data-parallel multi GPU on AWS and Kubeflow
    Hi all, I've been trying to set up DDP multi-node training on AWS for a week and am finally able to make it work. I didn't find any resources for the same. So thought would write a blog and share it. Please provide feedback and see if this is helpful for you https://medium.com/@sachinchandra/running-yolo-v5-with-ddp-on-aws-8a4f07a77cf submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [R] Can machine learning make side-channel attacks even stronger?
    Twitter thread: https://twitter.com/jackcook36/status/1534920169369309184 Paper: https://jackcook.github.io/bigger-fish/paper.pdf Key findings: Machine learning can be used to identify activity on your computer from traces recorded in JavaScript that measure CPU instruction throughput over time We found this type of attack exploits signals from system interrupts, which operating systems use to interact with hardware devices When a core processes interrupts, it pauses the execution of an attacker, creating a signal that can be exploited Our loop-counting attack can correctly identify one of 100 websites being opened 96.6% of the time in Chrome on Linux We identified a randomized timer mitigation that reduces our attack’s accuracy to near chance Please let me know if you have any feedback or questions! submitted by /u/jackcook [link] [comments]  ( 1 min )
    [D] Benchmark Object Detection Hyperparameters
    I want to conduct benchmark experiments: Faster R-CNN vs YOLOv3 vs YOLOv4 vs YOLOv5. For that reason, I want to fix the hyperparameters: optimizer, learning rate, weight decay and learning rate scheduler. For optimizer, due to different frameworks, I have to go with ADAM (b0=0.9, b1=0.999, eps=1e-7). What parameters should I choose for weigt decay, and learning rate scheduler, given that different models converge at different epoch/steps? Should I go with cosine decay, manual step (with 0.1 decay at 80 and 90% of total epoch/steps), or something else? Note: different frameworks have different "default" hyperparameters, maybe I should stick to defalt? submitted by /u/giakou4 [link] [comments]  ( 1 min )
    [D] Use conversational AI based on GPT-J/GPT-NeoX in Discord
    Hello all, It is very easy to build a chatbot in a Discord server thanks to great AI models like GPT-3, GPT-J, and GPT-NeoX. In this article, we I'm showing you how to code your own conversational bot in Node.js by using GPT-J and GPT-NeoX through the NLP Cloud API: https://nlpcloud.io/build-gpt-j-gpt-neox-discord-chatbot-with-nlpcloud.html As you might know, these AI models are "stateless", meaning that they can't remember the chat history. So I am showing how to handle this by automatically re-sending the chat history in each request, and by truncating the history when it is too long. If you have questions please don't hesitate to ask. I hope it will be useful! Julien submitted by /u/juliensalinas [link] [comments]  ( 1 min )
    [P] Virtual Background project (feat. The Rock with Alpacas) with PyTorch Implementation
    ​ The Rock with Alpaca Hey Guys! Recently, I worked on a side project that generates virtual background (like the one in Zoom) with semantic segmentation. I used BiSeNet as a base model. My goal was to implement everything from scratch without using any fancy libraries and it works pretty well! You can test it on either a single image or a real-time webcam. Feel free to leave comments for any feedback! ​ Project GitHub Repo: Link BiSeNet detailed review: my blog If you want to see other research paper implementations check out my repo! submitted by /u/JasonTheCoders [link] [comments]  ( 1 min )
    [Discussion] Fine tune model for long context
    How to train GPT or BERT for large context where context length is more than 1024 tokens. Truncating the context is not an option as the complete context is important. One approach that I can think of is breaking/dividing the context into multiple chunks. What are my other options?t submitted by /u/Expert-Departure-236 [link] [comments]  ( 1 min )
    [R] The Annotated Diffusion Model
    From huggingface post: https://huggingface.co/blog/annotated-diffusion A New great article joined the Annotated series: The Annotated Transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html, The Annotated GPT-2 https://amaarora.github.io/2020/02/18/annotatedGPT2.html submitted by /u/ghosthamlet [link] [comments]  ( 2 min )
  • Open

    Face of the night (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    Brave Heart (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    AI Dream 56 - Post-Apocalyptic WarDepression by AI
    submitted by /u/LordPewPew777 [link] [comments]
    AI to create similar videos based on input
    Hi guys! Does someone know if there's an app or something similar to create similar videos based on (multiple) video inputs? Wishes! submitted by /u/kreismeis [link] [comments]
    Researchers Built a Neural Network That Not Only Solves but Explains and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level
    👉 They created a pre-trained neural network on the text and finetuned the code to answer mathematics course problems, explain solutions, and produce new questions on a human level. It automatically synthesizes programs and runs them to answer course problems with 81 percent automated accuracy utilizing few-shot learning and OpenAI’s Codex transformer. 👉 They also curated a new dataset of questions from MIT’s most famous mathematics courses. The neural network answers questions from the MATH dataset (including questions on Prealgebra, Algebra, Counting, and Probability, Intermediate Algebra, Number Theory, and Precalculus), which is the current standard of advanced mathematics issues meant to examine mathematical thinking. Continue reading | Check out the paper and github ​ https://preview.redd.it/3pq9vu2fnm491.png?width=1198&format=png&auto=webp&s=ab9425a5130d52a1c0d9cfc525bac546eccfec57 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    A.I. Coding Overview
    submitted by /u/a1a3a5a7a9 [link] [comments]
    List of (free) GAN / generative AI apps and playgrounds
    submitted by /u/nathan_thinks [link] [comments]  ( 1 min )
    I’ve created a fashion blogger bot you can freely chat with
    Hey guys, I’ve just made a language model that behaves like an average person (I hope). She perceives herself as a fashion blogger. I’ve made her capable chat about fashion and music topics. I would be over the moon if you could test and share your feedback in the comments regarding its ability to support open-domain dialogue. Here is the bot -- just tap submitted by /u/GBalchidi [link] [comments]  ( 1 min )
    CYBERHOLIC BLISS | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    The last line killed me (I hope this is within the parameters, it's GPT-3)
    ​ https://preview.redd.it/dndbky83jl491.png?width=2402&format=png&auto=webp&s=063f10ffa4f18b215d2ff87aa9c3f0623ada32b9 submitted by /u/thatgerhard [link] [comments]
    looking for ai that generates sexual content
    Hi, I am looking for some kind of software that can make random generated sexual pictures. NOT of existing people (like the "undressing" apps). The sexual content must be completely artifically generated. Thanks submitted by /u/insert_username--- [link] [comments]
    Build a Discord chatbot based on GPT-J/GPT-NeoX
    Hello all, It is very easy to build a chatbot in a Discord server thanks to great AI models like GPT-3, GPT-J, and GPT-NeoX. In this article, we I'm showing you how to code your own conversational bot in Node.js by using GPT-J and GPT-NeoX through the NLP Cloud API: https://nlpcloud.io/build-gpt-j-gpt-neox-discord-chatbot-with-nlpcloud.html As you might know, these AI models are "stateless", meaning that they can't remember the chat history. So I am showing how to handle this by automatically re-sending the chat history in each request, and by truncating the history when it is too long. If you have questions please don't hesitate to ask. I hope it will be useful! Julien submitted by /u/juliensalinas [link] [comments]  ( 1 min )
    DISCO DIFFUSION 3D AI ART ANIMATION | ANGEL OF DEATH, AZRAEL
    submitted by /u/Available_Tadpole829 [link] [comments]
    A Samurai Story, DISCO DIFFUSION V5.2 3D animation (using both image and text prompts) OC
    submitted by /u/crabmansboxturtle [link] [comments]  ( 1 min )
    DISCO DIFFUSION 3D AI ART ANIMATION | EXTRATERRESTRIAL ESCAPADE
    submitted by /u/Available_Tadpole829 [link] [comments]
    Thor in Battle - Neural-Art Parody / [4K] Creative Experiment w/ GPT-3, VQGAN+CLIP
    submitted by /u/MLInsights [link] [comments]
  • Open

    LIMoE: Learning Multiple Modalities with One Sparse Mixture of Experts Model
    Posted by Basil Mustafa, Research Software Engineer and Carlos Riquelme, Research Scientist, Google Research, Brain team Sparse models stand out among the most promising approaches for the future of deep learning. Instead of every part of a model processing every input (“dense” modeling), sparse models employing conditional computation learn to route individual inputs to different “experts” in a potentially huge network. This has many benefits. First, model size can increase while keeping computational cost constant — an effective and environmentally friendlier way to scale models, which is often key to high performance. Sparsity also naturally compartmentalizes neural networks. Dense models that learn many different tasks simultaneously (multitask) or sequentially (continual learning) of…  ( 9 min )
  • Open

    Techniques for Training Large Neural Networks
    submitted by /u/nickb [link] [comments]
    What impacts the speed of prediction for ANNs?
    I am currently building several ANNs to approximate lengthy PDE calculations. I am curious as to how one can minimise the speed of prediction when it comes to hyperparameter optimization. Is it best to minimise the number of weight parameters in the model? (I know this benefits storage) or is it best to minimise the number of layers? Any help would be appreciated, cheers! submitted by /u/Algo-G-H [link] [comments]  ( 1 min )
  • Open

    Student-powered machine learning
    Recent MEng graduates reflect on their application-focused research as affiliates of the MIT-IBM Watson AI Lab.  ( 7 min )
  • Open

    Techniques for Training Large Neural Networks
    Large neural networks are at the core of many recent advances in AI, but training them is a difficult engineering and research challenge which requires orchestrating a cluster of GPUs to perform a single synchronized calculation. As cluster and model sizes have grown, machine learning practitioners have developed an increasing  ( 6 min )
  • Open

    Smartgrids and Reinforcement Learning
    Hi every1ne, Are you, interested by #smartgrids with #reinforcementlearning ? Here is a little sharing ✋ #LittleBigCity, an outstanding new open-source project for smartgrid, has recently emerged. Inspired by #CityLearn, which focuses solely on the customer side of the smartgrid, LittleBigCode and Paul-Adrien Nicole created #LittleBigCity by taking this limitation into consideration. A new open-source simulator that generates a two-sided smartgrid: the CityLearn-inspired consumer side and the producer side from LittleBigCity. With Streamlit, they have also added a way to view the smartgrid's changes in real time 🥳 We welcome pull requests on both the simulator and reinforcement learning sides. Feel free to drop by and share the information with your network 🌎 #smartgrid for the future 🥇 Gitlab LINK: https://gitlab.com/littlebigcode/public/littlebigcity City learn authors: José Ramón Vázquez Canteli & Zoltan Nagy LittleBigCity authors: Johan Jublanc Paul-Adrien Nicole submitted by /u/SimonSoftEng [link] [comments]  ( 1 min )
    Schmidhuber notes 25th anniversary of LSTM
    submitted by /u/gwern [link] [comments]  ( 1 min )
    RL topics for MS research.
    I was wondering what are the research areas to explore for a master thesis work. I'm thinking about research problems that are on the implementation side rather than on the theoretical side of RL. Goal-conditioned RL and autotelic agents are some of the interesting areas to explore. In terms of implementation, what are the areas to look for as a thesis work? submitted by /u/thisisdespaleo [link] [comments]  ( 1 min )
  • Open

    How to Write a Thank You Letter for a Scholarship with the Help Of AI
    It’s no secret that writing a thank you letter can be difficult. You want to express your gratitude, but you also don’t want to sound too…  ( 4 min )
    Conversational AI at Ludicrous Speed
    The Problem  ( 7 min )
  • Open

    Out of This World: ‘Mass Effect Legendary Edition’ and ‘It Takes Two’ Lead GFN Thursday Updates
    Some may call this GFN Thursday legendary as Mass Effect Legendary Edition and It Takes Two join the GeForce NOW library. Both games expand the available number of Electronic Arts games streaming from our GeForce cloud servers, and are part of 10 new additions this week. Adventure Awaits In The Cloud Relive the saga of Read article > The post Out of This World: ‘Mass Effect Legendary Edition’ and ‘It Takes Two’ Lead GFN Thursday Updates appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    What do we learn? Debunking the Myth of Unsupervised Outlier Detection. (arXiv:2206.03698v1 [cs.CV])
    Even though auto-encoders (AEs) have the desirable property of learning compact representations without labels and have been widely applied to out-of-distribution (OoD) detection, they are generally still poorly understood and are used incorrectly in detecting outliers where the normal and abnormal distributions are strongly overlapping. In general, the learned manifold is assumed to contain key information that is only important for describing samples within the training distribution, and that the reconstruction of outliers leads to high residual errors. However, recent work suggests that AEs are likely to be even better at reconstructing some types of OoD samples. In this work, we challenge this assumption and investigate what auto-encoders actually learn when they are posed to solve two different tasks. First, we propose two metrics based on the Fr\'echet inception distance (FID) and confidence scores of a trained classifier to assess whether AEs can learn the training distribution and reliably recognize samples from other domains. Second, we investigate whether AEs are able to synthesize normal images from samples with abnormal regions, on a more challenging lung pathology detection task. We have found that state-of-the-art (SOTA) AEs are either unable to constrain the latent manifold and allow reconstruction of abnormal patterns, or they are failing to accurately restore the inputs from their latent distribution, resulting in blurred or misaligned reconstructions. We propose novel deformable auto-encoders (MorphAEus) to learn perceptually aware global image priors and locally adapt their morphometry based on estimated dense deformation fields. We demonstrate superior performance over unsupervised methods in detecting OoD and pathology.  ( 2 min )
    Machine learning-based patient selection in an emergency department. (arXiv:2206.03752v1 [cs.LG])
    The performance of Emergency Departments (EDs) is of great importance for any health care system, as they serve as the entry point for many patients. However, among other factors, the variability of patient acuity levels and corresponding treatment requirements of patients visiting EDs imposes significant challenges on decision makers. Balancing waiting times of patients to be first seen by a physician with the overall length of stay over all acuity levels is crucial to maintain an acceptable level of operational performance for all patients. To address those requirements when assigning idle resources to patients, several methods have been proposed in the past, including the Accumulated Priority Queuing (APQ) method. The APQ method linearly assigns priority scores to patients with respect to their time in the system and acuity level. Hence, selection decisions are based on a simple system representation that is used as an input for a selection function. This paper investigates the potential of an Machine Learning (ML) based patient selection method. It assumes that for a large set of training data, including a multitude of different system states, (near) optimal assignments can be computed by a (heuristic) optimizer, with respect to a chosen performance metric, and aims to imitate such optimal behavior when applied to new situations. Thereby, it incorporates a comprehensive state representation of the system and a complex non-linear selection function. The motivation for the proposed approach is that high quality selection decisions may depend on a variety of factors describing the current state of the ED, not limited to waiting times, which can be captured and utilized by the ML model. Results show that the proposed method significantly outperforms the APQ method for a majority of evaluated settings  ( 2 min )
    Latent Boundary-guided Adversarial Training. (arXiv:2206.03717v1 [cs.LG])
    Deep Neural Networks (DNNs) have recently achieved great success in many classification tasks. Unfortunately, they are vulnerable to adversarial attacks that generate adversarial examples with a small perturbation to fool DNN models, especially in model sharing scenarios. Adversarial training is proved to be the most effective strategy that injects adversarial examples into model training to improve the robustness of DNN models to adversarial attacks. However, adversarial training based on the existing adversarial examples fails to generalize well to standard, unperturbed test data. To achieve a better trade-off between standard accuracy and adversarial robustness, we propose a novel adversarial training framework called LAtent bounDary-guided aDvErsarial tRaining (LADDER) that adversarially trains DNN models on latent boundary-guided adversarial examples. As opposed to most of the existing methods that generate adversarial examples in the input space, LADDER generates a myriad of high-quality adversarial examples through adding perturbations to latent features. The perturbations are made along the normal of the decision boundary constructed by an SVM with an attention mechanism. We analyze the merits of our generated boundary-guided adversarial examples from a boundary field perspective and visualization view. Extensive experiments and detailed analysis on MNIST, SVHN, CelebA, and CIFAR-10 validate the effectiveness of LADDER in achieving a better trade-off between standard accuracy and adversarial robustness as compared with vanilla DNNs and competitive baselines.  ( 2 min )
    Two Ways of Understanding Social Dynamics: Analyzing the Predictability of Emergent of Objects in Reddit r/place Dependent on Locality in Space and Time. (arXiv:2206.03563v1 [physics.soc-ph])
    Lately, studying social dynamics in interacting agents has been boosted by the power of computer models, which bring the richness of qualitative work, while offering the precision, transparency, extensiveness, and replicability of statistical and mathematical approaches. A particular set of phenomena for the study of social dynamics is Web collaborative platforms. A dataset of interest is r/place, a collaborative social experiment held in 2017 on Reddit, which consisted of a shared online canvas of 1000 pixels by 1000 pixels co-edited by over a million recorded users over 72 hours. In this paper, we designed and compared two methods to analyze the dynamics of this experiment. Our first method consisted in approximating the set of 2D cellular-automata-like rules used to generate the canvas images and how these rules change over time. The second method consisted in a convolutional neural network (CNN) that learned an approximation to the generative rules in order to generate the complex outcomes of the canvas. Our results indicate varying context-size dependencies for the predictability of different objects in r/place in time and space. They also indicate a surprising peak in difficulty to statistically infer behavioral rules towards the middle of the social experiment, while user interactions did not drop until before the end. The combination of our two approaches, one rule-based and the other statistical CNN-based, shows the ability to highlight diverse aspects of analyzing social dynamics.  ( 2 min )
    Learning Interpretable Decision Rule Sets: A Submodular Optimization Approach. (arXiv:2206.03718v1 [cs.LG])
    Rule sets are highly interpretable logical models in which the predicates for decision are expressed in disjunctive normal form (DNF, OR-of-ANDs), or, equivalently, the overall model comprises an unordered collection of if-then decision rules. In this paper, we consider a submodular optimization based approach for learning rule sets. The learning problem is framed as a subset selection task in which a subset of all possible rules needs to be selected to form an accurate and interpretable rule set. We employ an objective function that exhibits submodularity and thus is amenable to submodular optimization techniques. To overcome the difficulty arose from dealing with the exponential-sized ground set of rules, the subproblem of searching a rule is casted as another subset selection task that asks for a subset of features. We show it is possible to write the induced objective function for the subproblem as a difference of two submodular (DS) functions to make it approximately solvable by DS optimization algorithms. Overall, the proposed approach is simple, scalable, and likely to be benefited from further research on submodular optimization. Experiments on real datasets demonstrate the effectiveness of our method.  ( 2 min )
    How does overparametrization affect performance on minority groups?. (arXiv:2206.03515v1 [cs.LG])
    The benefits of overparameterization for the overall performance of modern machine learning (ML) models are well known. However, the effect of overparameterization at a more granular level of data subgroups is less understood. Recent empirical studies demonstrate encouraging results: (i) when groups are not known, overparameterized models trained with empirical risk minimization (ERM) perform better on minority groups; (ii) when groups are known, ERM on data subsampled to equalize group sizes yields state-of-the-art worst-group-accuracy in the overparameterized regime. In this paper, we complement these empirical studies with a theoretical investigation of the risk of overparameterized random feature models on minority groups. In a setting in which the regression functions for the majority and minority groups are different, we show that overparameterization always improves minority group performance.  ( 2 min )
    Joint Adversarial Learning for Cross-domain Fair Classification. (arXiv:2206.03656v1 [cs.LG])
    Modern machine learning (ML) models are becoming increasingly popular and are widely used in decision-making systems. However, studies have shown critical issues of ML discrimination and unfairness, which hinder their adoption on high-stake applications. Recent research on fair classifiers has drawn significant attention to develop effective algorithms to achieve fairness and good classification performance. Despite the great success of these fairness-aware machine learning models, most of the existing models require sensitive attributes to preprocess the data, regularize the model learning or postprocess the prediction to have fair predictions. However, sensitive attributes are often incomplete or even unavailable due to privacy, legal or regulation restrictions. Though we lack the sensitive attribute for training a fair model in the target domain, there might exist a similar domain that has sensitive attributes. Thus, it is important to exploit auxiliary information from the similar domain to help improve fair classification in the target domain. Therefore, in this paper, we study a novel problem of exploring domain adaptation for fair classification. We propose a new framework that can simultaneously estimate the sensitive attributes while learning a fair classifier in the target domain. Extensive experiments on real-world datasets illustrate the effectiveness of the proposed model for fair classification, even when no sensitive attributes are available in the target domain.  ( 2 min )
    $p$-Sparsified Sketches for Fast Multiple Output Kernel Methods. (arXiv:2206.03827v1 [stat.ML])
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, that consists in looking for solutions among a subspace of reduced dimension, is a widely studied approach to alleviate this numerical burden. However, fast sketching strategies, such as non-adaptive subsampling, significantly degrade the guarantees of the algorithms, while theoretically-accurate sketches, such as the Gaussian one, turn out to remain relatively slow in practice. In this paper, we introduce the $p$-sparsified sketches, that combine the benefits from both approaches to achieve a good tradeoff between statistical accuracy and computational efficiency. To support our method, we derive excess risk bounds for both single and multiple output problems, with generic Lipschitz losses, providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. We also provide empirical evidences of the superiority of our sketches over recent SOTA approaches.  ( 2 min )
    Disentangled Ontology Embedding for Zero-shot Learning. (arXiv:2206.03739v1 [cs.AI])
    Knowledge Graph (KG) and its variant of ontology have been widely used for knowledge representation, and have shown to be quite effective in augmenting Zero-shot Learning (ZSL). However, existing ZSL methods that utilize KGs all neglect the intrinsic complexity of inter-class relationships represented in KGs. One typical feature is that a class is often related to other classes in different semantic aspects. In this paper, we focus on ontologies for augmenting ZSL, and propose to learn disentangled ontology embeddings guided by ontology properties to capture and utilize more fine-grained class relationships in different aspects. We also contribute a new ZSL framework named DOZSL, which contains two new ZSL solutions based on generative models and graph propagation models, respectively, for effectively utilizing the disentangled ontology embeddings. Extensive evaluations have been conducted on five benchmarks across zero-shot image classification (ZS-IMGC) and zero-shot KG completion (ZS-KGC). DOZSL often achieves better performance than the state-of-the-art, and its components have been verified by ablation studies and case studies. Our codes and datasets are available at https://github.com/zjukg/DOZSL.  ( 2 min )
    Neural Network Compression via Effective Filter Analysis and Hierarchical Pruning. (arXiv:2206.03596v1 [cs.LG])
    Network compression is crucial to making the deep networks to be more efficient, faster, and generalizable to low-end hardware. Current network compression methods have two open problems: first, there lacks a theoretical framework to estimate the maximum compression rate; second, some layers may get over-prunned, resulting in significant network performance drop. To solve these two problems, this study propose a gradient-matrix singularity analysis-based method to estimate the maximum network redundancy. Guided by that maximum rate, a novel and efficient hierarchical network pruning algorithm is developed to maximally condense the neuronal network structure without sacrificing network performance. Substantial experiments are performed to demonstrate the efficacy of the new method for pruning several advanced convolutional neural network (CNN) architectures. Compared to existing pruning methods, the proposed pruning algorithm achieved state-of-the-art performance. At the same or similar compression ratio, the new method provided the highest network prediction accuracy as compared to other methods.  ( 2 min )
    Alternately Optimized Graph Neural Networks. (arXiv:2206.03638v1 [cs.LG])
    Graph Neural Networks (GNNs) have demonstrated powerful representation capability in numerous graph-based tasks. Specifically, the decoupled structures of GNNs such as APPNP become popular due to their simplicity and performance advantages. However, the end-to-end training of these GNNs makes them inefficient in computation and memory consumption. In order to deal with these limitations, in this work, we propose an alternating optimization framework for graph neural networks that does not require end-to-end training. Extensive experiments under different settings demonstrate that the performance of the proposed algorithm is comparable to existing state-of-the-art algorithms but has significantly better computation and memory efficiency. Additionally, we show that our framework can be taken advantage to enhance existing decoupled GNNs.  ( 2 min )
    Spam Detection Using BERT. (arXiv:2206.02443v2 [cs.CR] UPDATED)
    Emails and SMSs are the most popular tools in today communications, and as the increase of emails and SMSs users are increase, the number of spams is also increases. Spam is any kind of unwanted, unsolicited digital communication that gets sent out in bulk, spam emails and SMSs are causing major resource wastage by unnecessarily flooding the network links. Although most spam mail originate with advertisers looking to push their products, some are much more malicious in their intent like phishing emails that aims to trick victims into giving up sensitive information like website logins or credit card information this type of cybercrime is known as phishing. To countermeasure spams, many researches and efforts are done to build spam detectors that are able to filter out messages and emails as spam or ham. In this research we build a spam detector using BERT pre-trained model that classifies emails and messages by understanding to their context, and we trained our spam detector model using multiple corpuses like SMS collection corpus, Enron corpus, SpamAssassin corpus, Ling-Spam corpus and SMS spam collection corpus, our spam detector performance was 98.62%, 97.83%, 99.13% and 99.28% respectively. Keywords: Spam Detector, BERT, Machine learning, NLP, Transformer, Enron Corpus, SpamAssassin Corpus, SMS Spam Detection Corpus, Ling-Spam Corpus.  ( 2 min )
    Meta-Learning Transferable Parameterized Skills. (arXiv:2206.03597v1 [cs.LG])
    We propose a novel parameterized skill-learning algorithm that aims to learn transferable parameterized skills and synthesize them into a new action space that supports efficient learning in long-horizon tasks. We first propose novel learning objectives -- trajectory-centric diversity and smoothness -- that allow an agent to meta-learn reusable parameterized skills. Our agent can use these learned skills to construct a temporally-extended parameterized-action Markov decision process, for which we propose a hierarchical actor-critic algorithm that aims to efficiently learn a high-level control policy with the learned skills. We empirically demonstrate that the proposed algorithms enable an agent to solve a complicated long-horizon obstacle-course environment.  ( 2 min )
    Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators. (arXiv:2104.08323v2 [cs.LG] UPDATED)
    Deep neural network (DNN) accelerators received considerable attention in recent years due to the potential to save energy compared to mainstream hardware. Low-voltage operation of DNN accelerators allows to further reduce energy consumption, however, causes bit-level failures in the memory storing the quantized weights. Furthermore, DNN accelerators are vulnerable to adversarial attacks on voltage controllers or individual bits. In this paper, we show that a combination of robust fixed-point quantization, weight clipping, as well as random bit error training (RandBET) or adversarial bit error training (AdvBET) improves robustness against random or adversarial bit errors in quantized DNN weights significantly. This leads not only to high energy savings for low-voltage operation as well as low-precision quantization, but also improves security of DNN accelerators. In contrast to related work, our approach generalizes across operating voltages and accelerators and does not require hardware changes. Moreover, we present a novel adversarial bit error attack and are able to obtain robustness against both targeted and untargeted bit-level attacks. Without losing more than 0.8%/2% in test accuracy, we can reduce energy consumption on CIFAR10 by 20%/30% for 8/4-bit quantization. Allowing up to 320 adversarial bit errors, we reduce test error from above 90% (chance level) to 26.22%.  ( 2 min )
    A generative recommender system with GMM prior for cancer drug generation and sensitivity prediction. (arXiv:2206.03555v1 [cs.LG])
    Recent emergence of high-throughput drug screening assays sparkled an intensive development of machine learning methods, including models for prediction of sensitivity of cancer cell lines to anti-cancer drugs, as well as methods for generation of potential drug candidates. However, a concept of generation of compounds with specific properties and simultaneous modeling of their efficacy against cancer cell lines has not been comprehensively explored. To address this need, we present VADEERS, a Variational Autoencoder-based Drug Efficacy Estimation Recommender System. The generation of compounds is performed by a novel variational autoencoder with a semi-supervised Gaussian Mixture Model (GMM) prior. The prior defines a clustering in the latent space, where the clusters are associated with specific drug properties. In addition, VADEERS is equipped with a cell line autoencoder and a sensitivity prediction network. The model combines data for SMILES string representations of anti-cancer drugs, their inhibition profiles against a panel of protein kinases, cell lines biological features and measurements of the sensitivity of the cell lines to the drugs. The evaluated variants of VADEERS achieve a high r=0.87 Pearson correlation between true and predicted drug sensitivity estimates. We train the GMM prior in such a way that the clusters in the latent space correspond to a pre-computed clustering of the drugs by their inhibitory profiles. We show that the learned latent representations and new generated data points accurately reflect the given clustering. In summary, VADEERS offers a comprehensive model of drugs and cell lines properties and relationships between them, as well as a guided generation of novel compounds.  ( 2 min )
    Selective Network Linearization for Efficient Private Inference. (arXiv:2202.02340v2 [cs.CR] UPDATED)
    Private inference (PI) enables inference directly on cryptographically secure data.While promising to address many privacy issues, it has seen limited use due to extreme runtimes. Unlike plaintext inference, where latency is dominated by FLOPs, in PI non-linear functions (namely ReLU) are the bottleneck. Thus, practical PI demands novel ReLU-aware optimizations. To reduce PI latency we propose a gradient-based algorithm that selectively linearizes ReLUs while maintaining prediction accuracy. We evaluate our algorithm on several standard PI benchmarks. The results demonstrate up to $4.25\%$ more accuracy (iso-ReLU count at 50K) or $2.2\times$ less latency (iso-accuracy at 70\%) than the current state of the art and advance the Pareto frontier across the latency-accuracy space. To complement empirical results, we present a "no free lunch" theorem that sheds light on how and when network linearization is possible while maintaining prediction accuracy. Public code is available at \url{https://github.com/NYU-DICE-Lab/selective_network_linearization}.
    An Iterative Labeling Method for Annotating Fisheries Imagery. (arXiv:2204.12934v2 [cs.LG] UPDATED)
    In this paper, we present a methodology for fisheries-related data that allows us to converge on a labeled image dataset by iterating over the dataset with multiple training and production loops that can exploit crowdsourcing interfaces. We present our algorithm and its results on two separate sets of image data collected using the Seabed autonomous underwater vehicle. The first dataset comprises of 2,026 completely unlabeled images, while the second consists of 21,968 images that were point annotated by experts. Our results indicate that training with a small subset and iterating on that to build a larger set of labeled data allows us to converge to a fully annotated dataset with a small number of iterations. Even in the case of a dataset labeled by experts, a single iteration of the methodology improves the labels by discovering additional complicated examples of labels associated with fish that overlap, are very small, or obscured by the contrast limitations associated with underwater imagery.
    Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL. (arXiv:2206.02039v2 [cs.AI] UPDATED)
    Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CheckList testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CheckList approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general query-rule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.
    Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models. (arXiv:2205.15223v2 [cs.CL] UPDATED)
    Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. However, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the target options without introducing new parameters. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead. Analysis shows that ELECTRA learns distributions that align better with downstream tasks.
    Towards Individual Grevy's Zebra Identification via Deep 3D Fitting and Metric Learning. (arXiv:2206.02261v2 [cs.CV] UPDATED)
    This paper combines deep learning techniques for species detection, 3D model fitting, and metric learning in one pipeline to perform individual animal identification from photographs by exploiting unique coat patterns. This is the first work to attempt this and, compared to traditional 2D bounding box or segmentation based CNN identification pipelines, the approach provides effective and explicit view-point normalisation and allows for a straight forward visualisation of the learned biometric population space. Note that due to the use of metric learning the pipeline is also readily applicable to open set and zero shot re-identification scenarios. We apply the proposed approach to individual Grevy's zebra (Equus grevyi) identification and show in a small study on the SMALST dataset that the use of 3D model fitting can indeed benefit performance. In particular, back-projected textures from 3D fitted models improve identification accuracy from 48.0% to 56.8% compared to 2D bounding box approaches for the dataset. Whilst the study is far too small accurately to estimate the full performance potential achievable in larger-scale real-world application settings and in comparisons against polished tools, our work lays the conceptual and practical foundations for a next step in animal biometrics towards deep metric learning driven, fully 3D-aware animal identification in open population settings. We publish network weights and relevant facilitating source code with this paper for full reproducibility and as inspiration for further research.
    Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration. (arXiv:2205.14249v2 [physics.flu-dyn] UPDATED)
    The deep learning boom motivates researchers and practitioners of computational fluid dynamics eager to integrate the two areas.The PINN (physics-informed neural network) method is one such attempt. While most reports in the literature show positive outcomes of applying the PINN method, our experiments with it stifled such optimism. This work presents our not-so-successful story of using PINN to solve two fundamental flow problems: 2D Taylor-Green vortex at $Re = 100$ and 2D cylinder flow at $Re = 200$. The PINN method solved the 2D Taylor-Green vortex problem with acceptable results, and we used this flow as an accuracy and performance benchmark. About 32 hours of training were required for the PINN method's accuracy to match the accuracy of a $16 \times 16$ finite-difference simulation, which took less than 20 seconds. The 2D cylinder flow, on the other hand, did not even result in a physical solution. The PINN method behaved like a steady-flow solver and did not capture the vortex shedding phenomenon. By sharing our experience, we would like to emphasize that the PINN method is still a work-in-progress. More work is needed to make PINN feasible for real-world problems.
    Poisoning Deep Learning Based Recommender Model in Federated Learning Scenarios. (arXiv:2204.13594v2 [cs.IR] UPDATED)
    Various attack methods against recommender systems have been proposed in the past years, and the security issues of recommender systems have drawn considerable attention. Traditional attacks attempt to make target items recommended to as many users as possible by poisoning the training data. Benifiting from the feature of protecting users' private data, federated recommendation can effectively defend such attacks. Therefore, quite a few works have devoted themselves to developing federated recommender systems. For proving current federated recommendation is still vulnerable, in this work we probe to design attack approaches targeting deep learning based recommender models in federated learning scenarios. Specifically, our attacks generate poisoned gradients for manipulated malicious users to upload based on two strategies (i.e., random approximation and hard user mining). Extensive experiments show that our well-designed attacks can effectively poison the target models, and the attack effectiveness sets the state-of-the-art.
    Stop Oversampling for Class Imbalance Learning: A Critical Review. (arXiv:2202.03579v2 [cs.LG] UPDATED)
    For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.
    What's in the Black Box? The False Negative Mechanisms Inside Object Detectors. (arXiv:2203.07662v2 [cs.CV] UPDATED)
    In object detection, false negatives arise when a detector fails to detect a target object. To understand why object detectors produce false negatives, we identify five 'false negative mechanisms', where each mechanism describes how a specific component inside the detector architecture failed. Focusing on two-stage and one-stage anchor-box object detector architectures, we introduce a framework for quantifying these false negative mechanisms. Using this framework, we investigate why Faster R-CNN and RetinaNet fail to detect objects in benchmark vision datasets and robotics datasets. We show that a detector's false negative mechanisms differ significantly between computer vision benchmark datasets and robotics deployment scenarios. This has implications for the translation of object detectors developed for benchmark datasets to robotics applications.
    Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images. (arXiv:2109.10778v2 [cs.CV] UPDATED)
    Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development. However, generating exhaustive and accurate annotations is labor-intensive, challenging, and costly. Drawing only coarse and approximate annotations is a much easier task, less costly, and it alleviates pathologists' workload. In this paper, we study the problem of refining these approximate annotations in digital pathology to obtain more accurate ones. Some previous works have explored obtaining machine learning models from these inaccurate annotations, but few of them tackle the refinement problem where the mislabeled regions should be explicitly identified and corrected, and all of them require a -- often very large -- number of training samples. We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data. Patches cropped from a WSI with inaccurate labels are processed jointly within a multiple instance learning framework, mitigating their impact on the predictive model and refining the segmentation. Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming state-of-the-art alternatives, even while learning from a single slide. Moreover, we demonstrate how real annotations drawn by pathologists can be efficiently refined and improved by the proposed approach. All these results demonstrate that LC-MIL is a promising, light-weight tool to provide fine-grained annotations from coarsely annotated pathology sets.
    STable: Table Generation Framework for Encoder-Decoder Models. (arXiv:2206.04045v1 [cs.CL])
    The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%.
    Inverse Contextual Bandits: Learning How Behavior Evolves over Time. (arXiv:2107.06317v3 [cs.LG] UPDATED)
    Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over time. For instance, as the medical community's understanding of organ transplantations has progressed over the years, a pertinent question is: How have actual organ allocation policies been evolving? To give an answer, we desire a policy learning method that provides interpretable representations of decision-making, in particular capturing an agent's non-stationary knowledge of the world, as well as operating in an offline manner. First, we model the evolving behavior of decision-makers in terms of contextual bandits, and formalize the problem of Inverse Contextual Bandits (ICB). Second, we propose two concrete algorithms as solutions, learning parametric and nonparametric representations of an agent's behavior. Finally, using both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as benchmarking and validating its accuracy.
    Geometry of Linear Convolutional Networks. (arXiv:2108.01538v2 [cs.LG] UPDATED)
    We study the family of functions that are represented by a linear convolutional neural network (LCN). These functions form a semi-algebraic subset of the set of linear maps from input space to output space. In contrast, the families of functions represented by fully-connected linear networks form algebraic sets. We observe that the functions represented by LCNs can be identified with polynomials that admit certain factorizations, and we use this perspective to describe the impact of the network's architecture on the geometry of the resulting function space. We further study the optimization of an objective function over an LCN, analyzing critical points in function space and in parameter space, and describing dynamical invariants for gradient descent. Overall, our theory predicts that the optimized parameters of an LCN will often correspond to repeated filters across layers, or filters that can be decomposed as repeated filters. We also conduct numerical and symbolic experiments that illustrate our results and present an in-depth analysis of the landscape for small architectures.
    Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions. (arXiv:2108.11328v2 [stat.ML] UPDATED)
    Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.
    FedSEAL: Semi-Supervised Federated Learning with Self-Ensemble Learning and Negative Learning. (arXiv:2110.07829v2 [cs.LG] UPDATED)
    Federated learning (FL), a popular decentralized and privacy-preserving machine learning (FL) framework, has received extensive research attention in recent years. The majority of existing works focus on supervised learning (SL) problems where it is assumed that clients carry labeled datasets while the server has no data. However, in realistic scenarios, clients are often unable to label their data due to the lack of expertise and motivation while the server may host a small amount of labeled data. How to reasonably utilize the server labeled data and the clients' unlabeled data is thus of paramount practical importance. In this paper, we propose a new FL algorithm, called FedSEAL, to solve this Semi-Supervised Federated Learning (SSFL) problem. Our algorithm utilizes self-ensemble learning and complementary negative learning to enhance both the accuracy and the efficiency of clients' unsupervised learning on unlabeled data, and orchestrates the model training on both the server side and the clients' side. Our experimental results on Fashion-MNIST and CIFAR10 datasets in the SSFL setting validate the effectiveness of our method, which outperforms the state-of-the-art SSFL methods by a large margin.
    Dissipative Deep Neural Dynamical Systems. (arXiv:2011.13492v3 [cs.LG] UPDATED)
    In this paper, we provide sufficient conditions for dissipativity and local asymptotic stability of discrete-time dynamical systems parametrized by deep neural networks. We leverage the representation of neural networks as pointwise affine maps, thus exposing their local linear operators and making them accessible to classical system analytic and design methods. This allows us to "crack open the black box" of the neural dynamical system's behavior by evaluating their dissipativity, and estimating their stationary points and state-space partitioning. We relate the norms of these local linear operators to the energy stored in the dissipative system with supply rates represented by their aggregate bias terms. Empirically, we analyze the variance in dynamical behavior and eigenvalue spectra of these local linear operators with varying weight factorizations, activation functions, bias terms, and depths.
    SelfCF: A Simple Framework for Self-supervised Collaborative Filtering. (arXiv:2107.03019v2 [cs.IR] UPDATED)
    Collaborative filtering (CF) is widely used to learn informative latent representations of users and items from observed interactions. Existing CF-based methods commonly adopt negative sampling to discriminate different items. Training with negative sampling on large datasets is computationally expensive. Further, negative items should be carefully sampled under the defined distribution, in order to avoid selecting an observed positive item in the training dataset. Unavoidably, some negative items sampled from the training dataset could be positive in the test set. In this paper, we propose a self-supervised collaborative filtering framework (SelfCF), that is specially designed for recommender scenario with implicit feedback. The proposed SelfCF framework simplifies the Siamese networks and can be easily applied to existing deep-learning based CF models, which we refer to as backbone networks. The main idea of SelfCF is to augment the output embeddings generated by backbone networks, because it is infeasible to augment raw input of user/item ids. We propose and study three output perturbation techniques that can be applied to different types of backbone networks including both traditional CF models and graph-based models. The framework enables learning informative representations of users and items without negative samples, and is agnostic to the encapsulated backbones. We conduct comprehensive experiments on four datasets to show that our framework may achieve even better recommendation accuracy than the encapsulated supervised counterpart with a 2$\times$--4$\times$ faster training speed. We also show that SelfCF can boost up the accuracy by up to 17.79\% on average, compared with a self-supervised framework BUIR.
    Attribution of Predictive Uncertainties in Classification Models. (arXiv:2107.08756v3 [cs.LG] UPDATED)
    Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity.
    To remove or not remove Mobile Apps? A data-driven predictive model approach. (arXiv:2206.03905v1 [cs.CR])
    Mobile app stores are the key distributors of mobile applications. They regularly apply vetting processes to the deployed apps. Yet, some of these vetting processes might be inadequate or applied late. The late removal of applications might have unpleasant consequences for developers and users alike. Thus, in this work we propose a data-driven predictive approach that determines whether the respective app will be removed or accepted. It also indicates the features' relevance that help the stakeholders in the interpretation. In turn, our approach can support developers in improving their apps and users in downloading the ones that are less likely to be removed. We focus on the Google App store and we compile a new data set of 870,515 applications, 56% of which have actually been removed from the market. Our proposed approach is a bootstrap aggregating of multiple XGBoost machine learning classifiers. We propose two models: user-centered using 47 features, and developer-centered using 37 features, the ones only available before deployment. We achieve the following Areas Under the ROC Curves (AUCs) on the test set: user-centered = 0.792, developer-centered = 0.762.
    Resolving the Human Subjects Status of Machine Learning's Crowdworkers. (arXiv:2206.04039v1 [cs.CY])
    In recent years, machine learning (ML) has come to rely more heavily on crowdworkers, both for building bigger datasets and for addressing research questions requiring human interaction or judgment. Owing to the diverse tasks performed by crowdworkers, and the myriad ways the resulting datasets are used, it can be difficult to determine when these individuals are best thought of as workers, versus as human subjects. These difficulties are compounded by conflicting policies, with some institutions and researchers treating all ML crowdwork as human subjects research, and other institutions holding that ML crowdworkers rarely constitute human subjects. Additionally, few ML papers involving crowdwork mention IRB oversight, raising the prospect that many might not be in compliance with ethical and regulatory requirements. In this paper, we focus on research in natural language processing to investigate the appropriate designation of crowdsourcing studies and the unique challenges that ML research poses for research oversight. Crucially, under the U.S. Common Rule, these judgments hinge on determinations of "aboutness", both whom (or what) the collected data is about and whom (or what) the analysis is about. We highlight two challenges posed by ML: (1) the same set of workers can serve multiple roles and provide many sorts of information; and (2) compared to the life sciences and social sciences, ML research tends to embrace a dynamic workflow, where research questions are seldom stated ex ante and data sharing opens the door for future studies to ask questions about different targets from the original study. In particular, our analysis exposes a potential loophole in the Common Rule, where researchers can elude research ethics oversight by splitting data collection and analysis into distinct studies. We offer several policy recommendations to address these concerns.
    Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens. (arXiv:2108.11193v2 [cs.CL] UPDATED)
    Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each token's string representation. We probe the embedding layer of pretrained language models and show that models learn the internal character composition of whole word and subword tokens to a surprising extent, without ever seeing the characters coupled with the tokens. Our results show that the embedding layer of RoBERTa holds enough information to accurately spell up to a third of the vocabulary and reach high average character ngram overlap on all token types. We further test whether enriching subword models with additional character information can improve language modeling, and observe that this method has a near-identical learning curve as training without spelling-based enrichment. Overall, our results suggest that language modeling objectives incentivize the model to implicitly learn some notion of spelling, and that explicitly teaching the model how to spell does not appear to enhance its performance on such tasks.
    An Information-Theoretic Framework for Supervised Learning. (arXiv:2203.00246v5 [cs.LG] UPDATED)
    Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learning under a different lens. In this paper, we propose a novel information-theoretic framework with its own notions of regret and sample complexity for analyzing the data requirements of machine learning. With our framework, we first work through some classical examples such as scalar estimation and linear regression to build intuition and introduce general techniques. Then, we use the framework to study the sample complexity of learning from data generated by deep sign neural networks, deep ReLU neural networks, and deep networks that are infinitely wide but have a bounded sum of weights. For sign neural networks, we recover sample-complexity bounds that follow from VC-dimension based arguments. For the latter two neural network environments, we establish new results that suggest that the sample complexity of learning under these data generating processes is at most linear and quadratic, respectively, in network depth.
    Model-Based Reinforcement Learning Is Minimax-Optimal for Offline Zero-Sum Markov Games. (arXiv:2206.04044v1 [cs.LG])
    This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $\gamma$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
    Structure-Aware Transformer for Graph Representation Learning. (arXiv:2202.03036v2 [stat.ML] UPDATED)
    The Transformer architecture has gained growing attention in graph representation learning recently, as it naturally overcomes several limitations of graph neural networks (GNNs) by avoiding their strict structural inductive biases and instead only encoding the graph structure via positional encoding. Here, we show that the node representations generated by the Transformer with positional encoding do not necessarily capture structural similarity between them. To address this issue, we propose the Structure-Aware Transformer, a class of simple and flexible graph Transformers built upon a new self-attention mechanism. This new self-attention incorporates structural information into the original self-attention by extracting a subgraph representation rooted at each node before computing the attention. We propose several methods for automatically generating the subgraph representation and show theoretically that the resulting representations are at least as expressive as the subgraph representations. Empirically, our method achieves state-of-the-art performance on five graph prediction benchmarks. Our structure-aware framework can leverage any existing GNN to extract the subgraph representation, and we show that it systematically improves performance relative to the base GNN model, successfully combining the advantages of GNNs and Transformers. Our code is available at https://github.com/BorgwardtLab/SAT .
    Scalable Joint Learning of Wireless Multiple-Access Policies and their Signaling. (arXiv:2206.03844v1 [cs.IT])
    In this paper, we apply an multi-agent reinforcement learning (MARL) framework allowing the base station (BS) and the user equipments (UEs) to jointly learn a channel access policy and its signaling in a wireless multiple access scenario. In this framework, the BS and UEs are reinforcement learning (RL) agents that need to cooperate in order to deliver data. The comparison with a contention-free and a contention-based baselines shows that our framework achieves a superior performance in terms of goodput even in high traffic situations while maintaining a low collision rate. The scalability of the proposed method is studied, since it is a major problem in MARL and this paper provides the first results in order to address it.
    Few-shot Prompting Toward Controllable Response Generation. (arXiv:2206.03931v1 [cs.CL])
    Much literature has shown that prompt-based learning is an efficient method to make use of the large pre-trained language model. Recent works also exhibit the possibility of steering a chatbot's output by plugging in an appropriate prompt. Gradient-based methods are often used to perturb the prompts. However, some language models are not even available to the public. In this work, we first explored the combination of prompting and reinforcement learning (RL) to steer models' generation without accessing any of the models' parameters. Second, to reduce the training effort and enhance the generalizability to the unseen task, we apply multi-task learning to make the model learn to generalize to new tasks better. The experiment results show that our proposed method can successfully control several state-of-the-art (SOTA) dialogue models without accessing their parameters. Furthermore, the model demonstrates the strong ability to quickly adapt to an unseen task in fewer steps than the baseline model.
    A Study of Continual Learning Methods for Q-Learning. (arXiv:2206.03934v1 [cs.LG])
    We present an empirical study on the use of continual learning (CL) methods in a reinforcement learning (RL) scenario, which, to the best of our knowledge, has not been described before. CL is a very active recent research topic concerned with machine learning under non-stationary data distributions. Although this naturally applies to RL, the use of dedicated CL methods is still uncommon. This may be due to the fact that CL methods often assume a decomposition of CL problems into disjoint sub-tasks of stationary distribution, that the onset of these sub-tasks is known, and that sub-tasks are non-contradictory. In this study, we perform an empirical comparison of selected CL methods in a RL problem where a physically simulated robot must follow a racetrack by vision. In order to make CL methods applicable, we restrict the RL setting and introduce non-conflicting subtasks of known onset, which are however not disjoint and whose distribution, from the learner's point of view, is still non-stationary. Our results show that dedicated CL methods can significantly improve learning when compared to the baseline technique of "experience replay".
    Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning. (arXiv:2206.03996v1 [cs.LG])
    Model-agnostic meta learning (MAML) is currently one of the dominating approaches for few-shot meta-learning. Albeit its effectiveness, the optimization of MAML can be challenging due to the innate bilevel problem structure. Specifically, the loss landscape of MAML is much more complex with possibly more saddle points and local minimizers than its empirical risk minimization counterpart. To address this challenge, we leverage the recently invented sharpness-aware minimization and develop a sharpness-aware MAML approach that we term Sharp-MAML. We empirically demonstrate that Sharp-MAML and its computation-efficient variant can outperform popular existing MAML baselines (e.g., $+12\%$ accuracy on Mini-Imagenet). We complement the empirical study with the convergence rate analysis and the generalization bound of Sharp-MAML. To the best of our knowledge, this is the first empirical and theoretical study on sharpness-aware minimization in the context of bilevel learning. The code is available at https://github.com/mominabbass/Sharp-MAML.
    Learning in games from a stochastic approximation viewpoint. (arXiv:2206.03922v1 [cs.GT])
    We develop a unified stochastic approximation framework for analyzing the long-run behavior of multi-agent online learning in games. Our framework is based on a "primal-dual", mirrored Robbins-Monro (MRM) template which encompasses a wide array of popular game-theoretic learning algorithms (gradient methods, their optimistic variants, the EXP3 algorithm for learning with payoff-based feedback in finite games, etc.). In addition to providing an integrated view of these algorithms, the proposed MRM blueprint allows us to obtain a broad range of new convergence results, both asymptotic and in finite time, in both continuous and finite games.
    Federated Learning Algorithms for Generalized Mixed-effects Model (GLMM) on Horizontally Partitioned Data from Distributed Sources. (arXiv:2109.14046v2 [stat.ML] UPDATED)
    Objectives: This paper develops two algorithms to achieve federated generalized linear mixed effect models (GLMM), and compares the developed model's outcomes with each other, as well as that from the standard R package (`lme4'). Methods: The log-likelihood function of GLMM is approximated by two numerical methods (Laplace approximation and Gaussian Hermite approximation), which supports federated decomposition of GLMM to bring computation to data. Results: Our developed method can handle GLMM to accommodate hierarchical data with multiple non-independent levels of observations in a federated setting. The experiment results demonstrate comparable (Laplace) and superior (Gaussian-Hermite) performances with simulated and real-world data. Conclusion: We developed and compared federated GLMMs with different approximations, which can support researchers in analyzing biomedical data to accommodate mixed effects and address non-independence due to hierarchical structures (i.e., institutes, region, country, etc.).
    COVIDHunter: An Accurate, Flexible, and Environment-Aware Open-Source COVID-19 Outbreak Simulation Model. (arXiv:2102.03667v2 [q-bio.PE] UPDATED)
    Background: Early detection and isolation of COVID-19 patients are essential for successful implementation of mitigation strategies and eventually curbing the disease spread. With a limited number of daily COVID-19 tests performed in every country, simulating the COVID-19 spread along with the potential effect of each mitigation strategy currently remains one of the most effective ways in managing the healthcare system and guiding policy-makers. Methods: We introduce COVIDHunter, a flexible and accurate COVID-19 outbreak simulation model that evaluates the current mitigation measures that are applied to a region and provides suggestions on what strength the upcoming mitigation measure should be. The key idea of COVIDHunter is to quantify the spread of COVID-19 in a geographical region by simulating the average number of new infections caused by an infected person considering the effect of external factors, such as environmental conditions (e.g., climate, temperature, humidity) and mitigation measures. Results: Using Switzerland as a case study, COVIDHunter estimates that if the policy-makers relax the mitigation measures by 50% for 30 days then both the daily capacity need for hospital beds and daily number of deaths increase exponentially by an average of 5.1x, who may occupy ICU beds and ventilators for a period of time. Unlike existing models, the COVIDHunter model accurately monitors and predicts the daily number of cases, hospitalizations, and deaths due to COVID-19. Our model is flexible to configure and simple to modify for modeling different scenarios under different environmental conditions and mitigation measures. Availability: We release the source code of the COVIDHunter implementation at https://github.com/CMU- SAFARI/COVIDHunter and show how to flexibly configure our model for any scenario and easily extend it for different measures and conditions than we account for.
    Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning. (arXiv:2205.04363v2 [cs.CV] UPDATED)
    Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.
    Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data. (arXiv:2206.02353v2 [cs.LG] UPDATED)
    Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in the field of computer vision, speech, natural language processing (NLP), and recently, with other types of modalities, including time series from sensors. The popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training. Acquiring annotated data can be a difficult and costly process. Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models using supervisory signals that have been freely obtained from the raw data. Unlike existing reviews of SSRL that have pre-dominately focused upon methods in the fields of CV or NLP for a single modality, we aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data. To this end, we 1) provide a comprehensive categorization of existing SSRL methods, 2) introduce a generic pipeline by defining the key components of a SSRL framework, 3) compare existing models in terms of their objective function, network architecture and potential applications, and 4) review existing multimodal techniques in each category and various modalities. Finally, we present existing weaknesses and future opportunities. We believe our work develops a perspective on the requirements of SSRL in domains that utilise multimodal and/or temporal data
    Inferring Lexicographically-Ordered Rewards from Preferences. (arXiv:2202.10153v2 [cs.LG] UPDATED)
    Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning.
    Neural Diffusion Processes. (arXiv:2206.03992v1 [stat.ML])
    Gaussian processes provide an elegant framework for specifying prior and posterior distributions over functions. They are, however, also computationally expensive, and limited by the expressivity of their covariance function. We propose Neural Diffusion Processes (NDPs), a novel approach based upon diffusion models, that learn to sample from distributions over functions. Using a novel attention block, we can incorporate properties of stochastic processes, such as exchangeability, directly into the NDP's architecture. We empirically show that NDPs are able to capture functional distributions that are close to the true Bayesian posterior of a Gaussian process. This enables a variety of downstream tasks, including hyperparameter marginalisation and Bayesian optimisation.
    Data-driven hysteretic behavior simulation based on weighted stacked pyramid neural network architecture. (arXiv:2206.03990v1 [cs.LG])
    An accurate and efficient simulation of the hysteretic behavior of materials and components is essential for structural analysis. The surrogate model based on neural networks shows significant potential in balancing efficiency and accuracy. However, its serial information flow and prediction based on single-level features adversely affect the network performance. Therefore, a weighted stacked pyramid neural network architecture is proposed herein. This network establishes a pyramid architecture by introducing multi-level shortcuts to directly integrate features in the output module. In addition, a weighted stacked strategy is proposed to replace the conventional feature fusion method. The weights of the features are determined based on their levels. These basic principles are verified, and key network settings are discussed. Subsequently, the redesigned architectures are compared with other commonly used algorithms. Results show that the testing mean-square error (MSE) loss of the networks on varied datasets can be reduced by an average of 34.7%. The redesigned architectures outperform 87.5% of cases, and the proposed Pyramid-GA network has the best overall performance.
    Modeling Disagreement in Automatic Data Labelling for Semi-Supervised Learning in Clinical Natural Language Processing. (arXiv:2205.14761v2 [cs.LG] UPDATED)
    Computational models providing accurate estimates of their uncertainty are crucial for risk management associated with decision making in healthcare contexts. This is especially true since many state-of-the-art systems are trained using the data which has been labelled automatically (self-supervised mode) and tend to overfit. In this work, we investigate the quality of uncertainty estimates from a range of current state-of-the-art predictive models applied to the problem of observation detection in radiology reports. This problem remains understudied for Natural Language Processing in the healthcare domain. We demonstrate that Gaussian Processes (GPs) provide superior performance in quantifying the risks of 3 uncertainty labels based on the negative log predictive probability (NLPP) evaluation metric and mean maximum predicted confidence levels (MMPCL), whilst retaining strong predictive performance.
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v1 [cs.LG])
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task. Specifically, we assume that pretraining dataset contains multi-view samples of ratio $1-\mu$ and single-view samples of ratio $\mu$, where multi/single-view samples has multiple/single discriminative semantics. Then for pretraining, we prove that 1) the convolution kernels of the MRP encoder captures all discriminative semantics in the pretraining data; and 2) a convolution kernel captures at most one semantic. Accordingly, in the downstream supervised fine-tuning, most semantics would be captured and different semantics would not be fused together. This helps the downstream fine-tuned network to easily establish the relation between kernels and semantic class labels. In this way, the fine-tuned encoder in MRP provably achieves zero test error with high probability for both multi-view and single-view test data. In contrast, as proved by~[3], conventional SL can only obtain a test accuracy between around $0.5\mu$ for single-view test data. These results together explain the benefits of MRP in downstream tasks. Experimental results testify to multi-view data assumptions and our theoretical implications.
    Option Transfer and SMDP Abstraction with Successor Features. (arXiv:2110.09196v2 [cs.LG] UPDATED)
    Abstraction plays an important role in the generalisation of knowledge and skills and is key to sample efficient learning. In this work, we study joint temporal and state abstraction in reinforcement learning, where temporally-extended actions in the form of options induce temporal abstractions, while aggregation of similar states with respect to abstract options induces state abstractions. Many existing abstraction schemes ignore the interplay of state and temporal abstraction. Consequently, the considered option policies often cannot be directly transferred to new environments due to changes in the state space and transition dynamics. To address this issue, we propose a novel abstraction scheme building on successor features. This includes an algorithm for transferring abstract options across different environments and a state abstraction mechanism that allows us to perform efficient planning with the transferred options.
    Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance. (arXiv:2206.03787v1 [cs.LG])
    Many deep reinforcement learning algorithms rely on simple forms of exploration, such as the additive action-noise often used in continuous control domains. Typically, the scaling factor of this action noise is chosen as a hyper-parameter and kept constant during training. In this paper, we analyze how the learned policy is impacted by the noise type, scale, and reducing of the scaling factor over time. We consider the two most prominent types of action-noise: Gaussian and Ornstein-Uhlenbeck noise, and perform a vast experimental campaign by systematically varying the noise type and scale parameter, and by measuring variables of interest like the expected return of the policy and the state space coverage during exploration. For the latter, we propose a novel state-space coverage measure $\operatorname{X}_{\mathcal{U}\text{rel}}$ that is more robust to boundary artifacts than previously proposed measures. Larger noise scales generally increase state space coverage. However, we found that increasing the space coverage using a larger noise scale is often not beneficial. On the contrary, reducing the noise-scale over the training process reduces the variance and generally improves the learning performance. We conclude that the best noise-type and scale are environment dependent, and based on our observations, derive heuristic rules for guiding the choice of the action noise as a starting point for further optimization.
    Entropic Convergence of Random Batch Methods for Interacting Particle Diffusion. (arXiv:2206.03792v1 [math.PR])
    We propose a co-variance corrected random batch method for interacting particle systems. By establishing a certain entropic central limit theorem, we provide entropic convergence guarantees for the law of the entire trajectories of all particles of the proposed method to the law of the trajectories of the discrete time interacting particle system whenever the batch size $B \gg (\alpha n)^{\frac{1}{3}}$ (where $n$ is the number of particles and $\alpha$ is the time discretization parameter). This in turn implies that the outputs of these methods are nearly \emph{statistically indistinguishable} when $B$ is even moderately large. Previous works mainly considered convergence in Wasserstein distance with required stringent assumptions on the potentials or the bounds had an exponential dependence on the time horizon. This work makes minimal assumptions on the interaction potentials and in particular establishes that even when the particle trajectories diverge to infinity, they do so in the same way for both the methods. Such guarantees are very useful in light of the recent advances in interacting particle based algorithms for sampling.
    Patch-based Object-centric Transformers for Efficient Video Generation. (arXiv:2206.04003v1 [cs.CV])
    In this work, we present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture that leverages object-centric information to efficiently model temporal dynamics in videos. We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos, with an added modification to model object-centric information via bounding boxes. Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information. When evaluated on various difficult object-centric datasets, our method achieves better or equal performance to other video generation models, while remaining computationally more efficient and scalable. In addition, we show that our method is able to perform object-centric controllability through bounding box manipulation, which may aid downstream tasks such as video editing, or visual planning. Samples are available at https://sites.google.com/view/povt-public}{https://sites.google.com/view/povt-public
    Neural Bandit with Arm Group Graph. (arXiv:2206.03644v1 [cs.LG])
    Contextual bandits aim to identify among a set of arms the optimal one with the highest reward based on their contextual information. Motivated by the fact that the arms usually exhibit group behaviors and the mutual impacts exist among groups, we introduce a new model, Arm Group Graph (AGG), where the nodes represent the groups of arms and the weighted edges formulate the correlations among groups. To leverage the rich information in AGG, we propose a bandit algorithm, AGG-UCB, where the neural networks are designed to estimate rewards, and we propose to utilize graph neural networks (GNN) to learn the representations of arm groups with correlations. To solve the exploitation-exploration dilemma in bandits, we derive a new upper confidence bound (UCB) built on neural networks (exploitation) for exploration. Furthermore, we prove that AGG-UCB can achieve a near-optimal regret bound with over-parameterized neural networks, and provide the convergence analysis of GNN with fully-connected layers which may be of independent interest. In the end, we conduct extensive experiments against state-of-the-art baselines on multiple public data sets, showing the effectiveness of the proposed algorithm.
    Accelerating Score-based Generative Models for High-Resolution Image Synthesis. (arXiv:2206.04029v1 [cs.CV])
    Score-based generative models (SGMs) have recently emerged as a promising class of generative models. The key idea is to produce high-quality images by recurrently adding Gaussian noises and gradients to a Gaussian sample until converging to the target distribution, a.k.a. the diffusion sampling. To ensure stability of convergence in sampling and generation quality, however, this sequential sampling process has to take a small step size and many sampling iterations (e.g., 2000). Several acceleration methods have been proposed with focus on low-resolution generation. In this work, we consider the acceleration of high-resolution generation with SGMs, a more challenging yet more important problem. We prove theoretically that this slow convergence drawback is primarily due to the ignorance of the target distribution. Further, we introduce a novel Target Distribution Aware Sampling (TDAS) method by leveraging the structural priors in space and frequency domains. Extensive experiments on CIFAR-10, CelebA, LSUN, and FFHQ datasets validate that TDAS can consistently accelerate state-of-the-art SGMs, particularly on more challenging high resolution (1024x1024) image generation tasks by up to 18.4x, whilst largely maintaining the synthesis quality. With fewer sampling iterations, TDAS can still generate good quality images. In contrast, the existing methods degrade drastically or even fails completely
    Progress Report: A Deep Learning Guided Exploration of Affine Unimodular Loop Transformations. (arXiv:2206.03684v1 [cs.PL])
    In this paper, we present a work in progress about a deep learning based approach for automatic code optimization in polyhedral compilers. The proposed technique explores combinations of affine and non-affine loop transformations to find the sequence of transformations that minimizes the execution time of a given program. This exploration is guided by a deep learning based cost model that evaluates the speedup that each sequence of transformations would yield. Preliminary results show that the proposed techniques achieve a 2.35x geometric mean speedup over state of the art polyhedral compilers (Pluto).
    Sequential Density Estimation via NCWFAs Sequential Density Estimation via Nonlinear Continuous Weighted Finite Automata. (arXiv:2206.03923v1 [cs.LG])
    Weighted finite automata (WFAs) have been widely applied in many fields. One of the classic problems for WFAs is probability distribution estimation over sequences of discrete symbols. Although WFAs have been extended to deal with continuous input data, namely continuous WFAs (CWFAs), it is still unclear how to approximate density functions over sequences of continuous random variables using WFA-based models, due to the limitation on the expressiveness of the model as well as the tractability of approximating density functions via CWFAs. In this paper, we propose a nonlinear extension to the CWFA model to first improve its expressiveness, we refer to it as the nonlinear continuous WFAs (NCWFAs). Then we leverage the so-called RNADE method, which is a well-known density estimator based on neural networks, and propose the RNADE-NCWFA model. The RNADE-NCWFA model computes a density function by design. We show that this model is strictly more expressive than the Gaussian HMM model, which CWFA cannot approximate. Empirically, we conduct a synthetic experiment using Gaussian HMM generated data. We focus on evaluating the model's ability to estimate densities for sequences of varying lengths (longer length than the training data). We observe that our model performs the best among the compared baseline methods.
    TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation. (arXiv:2206.03933v1 [cs.CL])
    We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).
    Sim2real for Reinforcement Learning Driven Next Generation Networks. (arXiv:2206.03846v1 [cs.LG])
    The next generation of networks will actively embrace artificial intelligence (AI) and machine learning (ML) technologies for automation networks and optimal network operation strategies. The emerging network structure represented by Open RAN (O-RAN) conforms to this trend, and the radio intelligent controller (RIC) at the centre of its specification serves as an ML applications host. Various ML models, especially Reinforcement Learning (RL) models, are regarded as the key to solving RAN-related multi-objective optimization problems. However, it should be recognized that most of the current RL successes are confined to abstract and simplified simulation environments, which may not directly translate to high performance in complex real environments. One of the main reasons is the modelling gap between the simulation and the real environment, which could make the RL agent trained by simulation ill-equipped for the real environment. This issue is termed as the sim2real gap. This article brings to the fore the sim2real challenge within the context of O-RAN. Specifically, it emphasizes the characteristics, and benefits that the digital twins (DT) could have as a place for model development and verification. Several use cases are presented to exemplify and demonstrate failure modes of the simulations trained RL model in real environments. The effectiveness of DT in assisting the development of RL algorithms is discussed. Then the current state of the art learning-based methods commonly used to overcome the sim2real challenge are presented. Finally, the development and deployment concerns for the RL applications realisation in O-RAN are discussed from the view of the potential issues like data interaction, environment bottlenecks, and algorithm design.
    Robust Semantic Communications with Masked VQ-VAE Enabled Codebook. (arXiv:2206.04011v1 [eess.SP])
    Although semantic communications have exhibited satisfactory performance for a large number of tasks, the impact of semantic noise and the robustness of the systems have not been well investigated. Semantic noise refers to the misleading between the intended semantic symbols and received ones, thus cause the failure of tasks. In this paper, we first propose a framework for the robust end-to-end semantic communication systems to combat the semantic noise. In particular, we analyze sample-dependent and sample-independent semantic noise. To combat the semantic noise, the adversarial training with weight perturbation is developed to incorporate the samples with semantic noise in the training dataset. Then, we propose to mask a portion of the input, where the semantic noise appears frequently, and design the masked vector quantized-variational autoencoder (VQ-VAE) with the noise-related masking strategy. We use a discrete codebook shared by the transmitter and the receiver for encoded feature representation. To further improve the system robustness, we develop a feature importance module (FIM) to suppress the noise-related and task-unrelated features. Thus, the transmitter simply needs to transmit the indices of these important task-related features in the codebook. Simulation results show that the proposed method can be applied in many downstream tasks and significantly improve the robustness against semantic noise with remarkable reduction on the transmission overhead.
    Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting. (arXiv:2206.04038v1 [cs.LG])
    The performance of time series forecasting has recently been greatly improved by the introduction of transformers. In this paper, we propose a general multi-scale framework that can be applied to state-of-the-art transformer-based time series forecasting models including Autoformer and Informer. Using iteratively refining a forecasted time series at multiple scales with shared weights, architecture adaptations and a specially-designed normalization scheme, we are able to achieve significant performance improvements with minimal additional computational overhead. Via detailed ablation studies, we demonstrate the effectiveness of our proposed architectural and methodological innovations. Furthermore, our experiments on four public datasets show that the proposed multi-scale framework outperforms the corresponding baselines with an average improvement of 13% and 38% over Autoformer and Informer, respectively.
    Automatic Personality Prediction; an Enhanced Method Using Ensemble Modeling. (arXiv:2007.04571v3 [cs.CL] UPDATED)
    Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the personality on different types of human generated/exchanged contents (like text, speech, image, video, etc.). The major objective of this study is to enhance the accuracy of APP from the text. To this end, we suggest five new APP methods including term frequency vector-based, ontology-based, enriched ontology-based, latent semantic analysis (LSA)-based, and deep learning-based (BiLSTM) methods. These methods as the base ones, contribute to each other to enhance the APP accuracy through ensemble modeling (stacking) based on a hierarchical attention network (HAN) as the meta-model. The results show that ensemble modeling enhances the accuracy of APP.
    Towards Bridging Algorithm and Theory for Unbiased Recommendation. (arXiv:2206.03851v1 [cs.IR])
    This work studies the problem of learning unbiased algorithms from biased feedback for recommender systems. We address this problem from both theoretical and algorithmic perspectives. Recent works in unbiased learning have advanced the state-of-the-art with various techniques such as meta-learning, knowledge distillation, and information bottleneck. Despite their empirical successes, most of them lack theoretical guarantee, forming non-negligible gaps between the theories and recent algorithms. To this end, we first view the unbiased recommendation problem from a distribution shift perspective. We theoretically analyze the generalization bounds of unbiased learning and suggest their close relations with recent unbiased learning objectives. Based on the theoretical analysis, we further propose a principled framework, Adversarial Self-Training (AST), for unbiased recommendation. Empirical evaluation on real-world and semi-synthetic datasets demonstrate the effectiveness of the proposed AST.
    ConFUDA: Contrastive Fewshot Unsupervised Domain Adaptation for Medical Image Segmentation. (arXiv:2206.03888v1 [cs.CV])
    Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled target domain. Contrastive learning (CL) in the context of UDA can help to better separate classes in feature space. However, in image segmentation, the large memory footprint due to the computation of the pixel-wise contrastive loss makes it prohibitive to use. Furthermore, labeled target data is not easily available in medical imaging, and obtaining new samples is not economical. As a result, in this work, we tackle a more challenging UDA task when there are only a few (fewshot) or a single (oneshot) image available from the target domain. We apply a style transfer module to mitigate the scarcity of target samples. Then, to align the source and target features and tackle the memory issue of the traditional contrastive loss, we propose the centroid-based contrastive learning (CCL) and a centroid norm regularizer (CNR) to optimize the contrastive pairs in both direction and magnitude. In addition, we propose multi-partition centroid contrastive learning (MPCCL) to further reduce the variance in the target features. Fewshot evaluation on MS-CMRSeg dataset demonstrates that ConFUDA improves the segmentation performance by 0.34 of the Dice score on the target domain compared with the baseline, and 0.31 Dice score improvement in a more rigorous oneshot setting.
    One Ring to Bring Them All: Towards Open-Set Recognition under Domain Shift. (arXiv:2206.03600v1 [cs.CV])
    In this paper, we investigate $\textit{open-set recognition}$ with domain shift, where the final goal is to achieve $\textit{Source-free Universal Domain Adaptation}$ (SF-UNDA), which addresses the situation where there exist both domain and category shifts between source and target domains. Under the SF-UNDA setting, the model cannot access source data anymore during target adaptation, which aims to address data privacy concerns. We propose a novel training scheme to learn a ($n$+1)-way classifier to predict the $n$ source classes and the unknown class, where samples of only known source categories are available for training. Furthermore, for target adaptation, we simply adopt a weighted entropy minimization to adapt the source pretrained model to the unlabeled target domain without source data. In experiments, we show: $\textbf{1)}$ After source training, the resulting source model can get excellent performance for $\textit{open-set single domain generalization}$ and also $\textit{open-set recognition}$ tasks; $\textbf{2)}$ After target adaptation, our method surpasses current UNDA approaches which demand source data during adaptation on several benchmarks. The versatility to several different tasks strongly proves the efficacy and generalization ability of our method. $\textbf{3)}$ When augmented with a closed-set domain adaptation approach during target adaptation, our source-free method further outperforms the current state-of-the-art UNDA method by 2.5%, 7.2% and 13% on Office-31, Office-Home and VisDA respectively. Code will be available in https://github.com/Albert0147/OneRing.
    An Analysis of Selection Bias Issue for Online Advertising. (arXiv:2206.03853v1 [cs.IR])
    In online advertising, a set of potential advertisements can be ranked by a certain auction system where usually the top-1 advertisement would be selected and displayed at an advertising space. In this paper, we show a selection bias issue that is present in an auction system. We analyze that the selection bias destroy truthfulness of the auction, which implies that the buyers (advertisers) on the auction can not maximize their profits. Although selection bias is well known in the field of statistics and there are lot of studies for it, our main contribution is to combine the theoretical analysis of the bias with the auction mechanism. In our experiment using online A/B testing, we evaluate the selection bias on an auction system whose ranking score is the function of predicted CTR (click through rate) of advertisement. The experiment showed that the selection bias is drastically reduced by using a multi-task learning which learns the data for all advertisements.
    Escaping the Big Data Paradigm with Compact Transformers. (arXiv:2104.05704v4 [cs.CV] UPDATED)
    With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers. We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets. Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results. Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance. CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches. Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks. Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.
    How unfair is private learning ?. (arXiv:2206.03985v1 [cs.LG])
    As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision~(\cifar and CelebA), and tabular~(Law School) datasets and learning algorithms.
    Diffusion Curvature for Estimating Local Curvature in High Dimensional Data. (arXiv:2206.03977v1 [cs.LG])
    We introduce a new intrinsic measure of local curvature on point-cloud data called diffusion curvature. Our measure uses the framework of diffusion maps, including the data diffusion operator, to structure point cloud data and define local curvature based on the laziness of a random walk starting at a point or region of the data. We show that this laziness directly relates to volume comparison results from Riemannian geometry. We then extend this scalar curvature notion to an entire quadratic form using neural network estimations based on the diffusion map of point-cloud data. We show applications of both estimations on toy data, single-cell data, and on estimating local Hessian matrices of neural network loss landscapes.
    Improving trajectory calculations using deep learning inspired single image superresolution. (arXiv:2206.04015v1 [physics.ao-ph])
    Lagrangian trajectory or particle dispersion models as well as semi-Lagrangian advection schemes require meteorological data such as wind, temperature and geopotential at the exact spatio-temporal locations of the particles that move independently from a regular grid. Traditionally, this high-resolution data has been obtained by interpolating the meteorological parameters from the gridded data of a meteorological model or reanalysis, e.g. using linear interpolation in space and time. However, interpolation errors are a large source of error for these models. Reducing them requires meteorological input fields with high space and time resolution, which may not always be available and can cause severe data storage and transfer problems. Here, we interpret this problem as a single image superresolution task. We interpret meteorological fields available at their native resolution as low-resolution images and train deep neural networks to up-scale them to higher resolution, thereby providing more accurate data for Lagrangian models. We train various versions of the state-of-the-art Enhanced Deep Residual Networks for Superresolution on low-resolution ERA5 reanalysis data with the goal to up-scale these data to arbitrary spatial resolution. We show that the resulting up-scaled wind fields have root-mean-squared errors half the size of the winds obtained with linear spatial interpolation at acceptable computational inference costs. In a test setup using the Lagrangian particle dispersion model FLEXPART and reduced-resolution wind fields, we demonstrate that absolute horizontal transport deviations of calculated trajectories from "ground-truth" trajectories calculated with undegraded 0.5{\deg} winds are reduced by at least 49.5% (21.8%) after 48 hours relative to trajectories using linear interpolation of the wind data when training on 2{\deg} to 1{\deg} (4{\deg} to 2{\deg}) resolution data.
    SYNERgy between SYNaptic consolidation and Experience Replay for general continual learning. (arXiv:2206.04016v1 [cs.NE])
    Continual learning (CL) in the brain is facilitated by a complex set of mechanisms. This includes the interplay of multiple memory systems for consolidating information as posited by the complementary learning systems (CLS) theory and synaptic consolidation for protecting the acquired knowledge from erasure. Thus, we propose a general CL method that creates a synergy between SYNaptic consolidation and dual memory Experience Replay (SYNERgy). Our method maintains a semantic memory that accumulates and consolidates information across the tasks and interacts with the episodic memory for effective replay. It further employs synaptic consolidation by tracking the importance of parameters during the training trajectory and anchoring them to the consolidated parameters in the semantic memory. To the best of our knowledge, our study is the first to employ dual memory experience replay in conjunction with synaptic consolidation that is suitable for general CL whereby the network does not utilize task boundaries or task labels during training or inference. Our evaluation on various challenging CL scenarios and characteristics analyses demonstrate the efficacy of incorporating both synaptic consolidation and CLS theory in enabling effective CL in DNNs.
    Performance, Transparency and Time. Feature selection to speed up the diagnosis of Parkinson's disease. (arXiv:2206.03716v1 [cs.LG])
    Accurate and early prediction of a disease allows to plan and improve a patient's quality of future life. During pandemic situations, the medical decision becomes a speed challenge in which physicians have to act fast to diagnose and predict the risk of the severity of the disease, moreover this is also of high priority for neurodegenerative diseases like Parkinson's disease. Machine Learning (ML) models with Features Selection (FS) techniques can be applied to help physicians to quickly diagnose a disease. FS optimally subset features that improve a model performance and help reduce the number of needed tests for a patient and hence speeding up the diagnosis. This study shows the result of three Feature Selection (FS) techniques pre-applied to a classifier algorithm, Logistic Regression, on non-invasive test results data. The three FS are Analysis of Variance (ANOVA) as filter based method, Least Absolute Shrinkage and Selection Operator (LASSO) as embedded method and Sequential Feature Selection (SFS) as wrapper method. The outcome shows that FS technique can help to build an efficient and effective classifier, hence improving the performance of the classifier while reducing the computation time.
    FEL: High Capacity Learning for Recommendation and Ranking via Federated Ensemble Learning. (arXiv:2206.03852v1 [cs.IR])
    Federated learning (FL) has emerged as an effective approach to address consumer privacy needs. FL has been successfully applied to certain machine learning tasks, such as training smart keyboard models and keyword spotting. Despite FL's initial success, many important deep learning use cases, such as ranking and recommendation tasks, have been limited from on-device learning. One of the key challenges faced by practical FL adoption for DL-based ranking and recommendation is the prohibitive resource requirements that cannot be satisfied by modern mobile systems. We propose Federated Ensemble Learning (FEL) as a solution to tackle the large memory requirement of deep learning ranking and recommendation tasks. FEL enables large-scale ranking and recommendation model training on-device by simultaneously training multiple model versions on disjoint clusters of client devices. FEL integrates the trained sub-models via an over-arch layer into an ensemble model that is hosted on the server. Our experiments demonstrate that FEL leads to 0.43-2.31% model quality improvement over traditional on-device federated learning - a significant improvement for ranking and recommendation system use cases.
    Blacklight: Defending Black-Box Adversarial Attacks on Deep Neural Networks. (arXiv:2006.14042v2 [cs.CR] UPDATED)
    Deep learning systems are known to be vulnerable to adversarial examples. In particular, query-based black-box attacks do not require knowledge of the deep learning model, but can compute adversarial examples over the network by submitting queries and inspecting returns. Recent work largely improves the efficiency of those attacks, demonstrating their practicality on today's ML-as-a-service platforms. We propose Blacklight, a new defense against query-based black-box adversarial attacks. The fundamental insight driving our design is that, to compute adversarial examples, these attacks perform iterative optimization over the network, producing image queries highly similar in the input space. Blacklight detects query-based black-box attacks by detecting highly similar queries, using an efficient similarity engine operating on probabilistic content fingerprints. We evaluate Blacklight against eight state-of-the-art attacks, across a variety of models and image classification tasks. Blacklight identifies them all, often after only a handful of queries. By rejecting all detected queries, Blacklight prevents any attack to complete, even when attackers persist to submit queries after account ban or query rejection. Blacklight is also robust against several powerful countermeasures, including an optimal black-box attack that approximates white-box attacks in efficiency. Finally, we illustrate how Blacklight generalizes to other domains like text classification.
    Dual Windows Are Significant: Learning from Mediastinal Window and Focusing on Lung Window. (arXiv:2206.03803v1 [eess.IV])
    Since the pandemic of COVID-19, several deep learning methods were proposed to analyze the chest Computed Tomography (CT) for diagnosis. In the current situation, the disease course classification is significant for medical personnel to decide the treatment. Most previous deep-learning-based methods extract features observed from the lung window. However, it has been proved that some appearances related to diagnosis can be observed better from the mediastinal window rather than the lung window, e.g., the pulmonary consolidation happens more in severe symptoms. In this paper, we propose a novel Dual Window RCNN Network (DWRNet), which mainly learns the distinctive features from the successive mediastinal window. Regarding the features extracted from the lung window, we introduce the Lung Window Attention Block (LWA Block) to pay additional attention to them for enhancing the mediastinal-window features. Moreover, instead of picking up specific slices from the whole CT slices, we use a Recurrent CNN and analyze successive slices as videos. Experimental results show that the fused and representative features improve the predictions of disease course by reaching the accuracy of 90.57%, against the baseline with an accuracy of 84.86%. Ablation studies demonstrate that combined dual window features are more efficient than lung-window features alone, while paying attention to lung-window features can improve the model's stability.
    "GAN I hire you?" -- A System for Personalized Virtual Job Interview Training. (arXiv:2206.03869v1 [cs.HC])
    Job interviews are usually high-stakes social situations where professional and behavioral skills are required for a satisfactory outcome. Professional job interview trainers give educative feedback about the shown behavior according to common standards. This feedback can be helpful concerning the improvement of behavioral skills needed for job interviews. A technological approach for generating such feedback might be a playful and low-key starting point for job interview training. Therefore, we extended an interactive virtual job interview training system with a Generative Adversarial Network (GAN)-based approach that first detects behavioral weaknesses and subsequently generates personalized feedback. To evaluate the usefulness of the generated feedback, we conducted a mixed-methods pilot study using mock-ups from the job interview training system. The overall study results indicate that the GAN-based generated behavioral feedback is helpful. Moreover, participants assessed that the feedback would improve their job interview performance.
    Out-of-Distribution Detection with Class Ratio Estimation. (arXiv:2206.03955v1 [stat.ML])
    Density-based Out-of-distribution (OOD) detection has recently been shown unreliable for the task of detecting OOD images. Various density ratio based approaches achieve good empirical performance, however methods typically lack a principled probabilistic modelling explanation. In this work, we propose to unify density ratio based methods under a novel framework that builds energy-based models and employs differing base distributions. Under our framework, the density ratio can be viewed as the unnormalized density of an implicit semantic distribution. Further, we propose to directly estimate the density ratio of a data sample through class ratio estimation. We report competitive results on OOD image problems in comparison with recent work that alternatively requires training of deep generative models for the task. Our approach enables a simple and yet effective path towards solving the OOD detection problem.
    PrivHAR: Recognizing Human Actions From Privacy-preserving Lens. (arXiv:2206.03891v1 [cs.CV])
    The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection along the human action recognition pipeline. Our framework parameterizes the camera lens to successfully degrade the quality of the videos to inhibit privacy attributes and protect against adversarial attacks while maintaining relevant features for activity recognition. We validate our approach with extensive simulations and hardware experiments.
    Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction. (arXiv:2206.03720v1 [cs.LG])
    The task of learning to map an input set onto a permuted sequence of its elements is challenging for neural networks. Set-to-sequence problems occur in natural language processing, computer vision and structure prediction, where interactions between elements of large sets define the optimal output. Models must exhibit relational reasoning, handle varying cardinalities and manage combinatorial complexity. Previous attention-based methods require $n$ layers of their set transformations to explicitly represent $n$-th order relations. Our aim is to enhance their ability to efficiently model higher-order interactions through an additional interdependence component. We propose a novel neural set encoding method called the Set Interdependence Transformer, capable of relating the set's permutation invariant representation to its elements within sets of any cardinality. We combine it with a permutation learning module into a complete, 3-part set-to-sequence model and demonstrate its state-of-the-art performance on a number of tasks. These range from combinatorial optimization problems, through permutation learning challenges on both synthetic and established NLP datasets for sentence ordering, to a novel domain of product catalog structure prediction. Additionally, the network's ability to generalize to unseen sequence lengths is investigated and a comparative empirical analysis of the existing methods' ability to learn higher-order interactions is provided.
    A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic. (arXiv:2007.05170v4 [math.OC] UPDATED)
    This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy.
    Motiflets -- Fast and Accurate Detection of Motifs in Time Series. (arXiv:2206.03735v1 [cs.LG])
    A motif intuitively is a short time series that repeats itself approximately the same within a larger time series. Such motifs often represent concealed structures, such as heart beats in an ECG recording, or sleep spindles in EEG sleep data. Motif discovery (MD) is the task of finding such motifs in a given input series. As there are varying definitions of what exactly a motif is, a number of algorithms exist. As central parameters they all take the length l of the motif and the maximal distance r between the motif's occurrences. In practice, however, suitable values for r are very hard to determine upfront, and the found motifs show a high variability. Setting the wrong input value will result in a motif that is not distinguishable from noise. Accordingly, finding an interesting motif with these methods requires extensive trial-and-error. We present a different approach to the MD problem. We define k-Motiflets as the set of exactly k occurrences of a motif of length l, whose maximum pairwise distance is minimal. This turns the MD problem upside-down: Our central parameter is not the distance threshold r, but the desired size k of a motif set, which we show is considerably more intuitive and easier to set. Based on this definition, we present exact and approximate algorithms for finding k-Motiflets and analyze their complexity. To further ease the use of our method, we describe extensions to automatically determine the right/suitable values for its input parameters. Thus, for the first time, extracting meaningful motif sets without any a-priori knowledge becomes feasible. By evaluating real-world use cases and comparison to 4 state-of-the-art MD algorithms, we show that our proposed algorithm is (a) quantitatively superior, finding larger motif sets at higher similarity, (b) qualitatively better, leading to clearer and easier to interpret motifs, and (c) has the lowest runtime.
    Contributor-Aware Defenses Against Adversarial Backdoor Attacks. (arXiv:2206.03583v1 [cs.CR])
    Deep neural networks for image classification are well-known to be vulnerable to adversarial attacks. One such attack that has garnered recent attention is the adversarial backdoor attack, which has demonstrated the capability to perform targeted misclassification of specific examples. In particular, backdoor attacks attempt to force a model to learn spurious relations between backdoor trigger patterns and false labels. In response to this threat, numerous defensive measures have been proposed; however, defenses against backdoor attacks focus on backdoor pattern detection, which may be unreliable against novel or unexpected types of backdoor pattern designs. We introduce a novel re-contextualization of the adversarial setting, where the presence of an adversary implicitly admits the existence of multiple database contributors. Then, under the mild assumption of contributor awareness, it becomes possible to exploit this knowledge to defend against backdoor attacks by destroying the false label associations. We propose a contributor-aware universal defensive framework for learning in the presence of multiple, potentially adversarial data sources that utilizes semi-supervised ensembles and learning from crowds to filter the false labels produced by adversarial triggers. Importantly, this defensive strategy is agnostic to backdoor pattern design, as it functions without needing -- or even attempting -- to perform either adversary identification or backdoor pattern detection during either training or inference. Our empirical studies demonstrate the robustness of the proposed framework against adversarial backdoor attacks from multiple simultaneous adversaries.
    Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression. (arXiv:2206.03665v1 [cs.LG])
    Recent advances in distributed optimization and learning have shown that communication compression is one of the most effective means of reducing communication. While there have been many results on convergence rates under communication compression, a theoretical lower bound is still missing. Analyses of algorithms with communication compression have attributed convergence to two abstract properties: the unbiased property or the contractive property. They can be applied with either unidirectional compression (only messages from workers to server are compressed) or bidirectional compression. In this paper, we consider distributed stochastic algorithms for minimizing smooth and non-convex objective functions under communication compression. We establish a convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection. To close the gap between the lower bound and the existing upper bounds, we further propose an algorithm, NEOLITHIC, which almost reaches our lower bound (up to logarithm factors) under mild conditions. Our results also show that using contractive bidirectional compression can yield iterative methods that converge as fast as those using unbiased unidirectional compression. The experimental results validate our findings.
    Classification of Stochastic Processes with Topological Data Analysis. (arXiv:2206.03973v1 [stat.ML])
    In this study, we examine if engineered topological features can distinguish time series sampled from different stochastic processes with different noise characteristics, in both balanced and unbalanced sampling schemes. We compare our classification results against the results of the same classification tasks built on statistical and raw features. We conclude that in classification tasks of time series, different machine learning models built on engineered topological features perform consistently better than those built on standard statistical and raw features.
    Click Prediction Boosting via Ensemble Learning Pipelines. (arXiv:2206.03592v1 [cs.LG])
    Online travel agencies (OTA's) advertise their website offers on meta-search bidding engines. The problem of predicting the number of clicks a hotel would receive for a given bid amount is an important step in the management of an OTA's advertisement campaign on a meta-search engine because bid times number of clicks defines the cost to be generated. Various regressors are ensembled in this work to improve click prediction performance. Following the preprocessing procedures, the feature set is divided into train and test groups depending on the samples' logging dates. The data collection is then subjected to XGBoost-based dimension reduction, which significantly reduces the dimension of features. The optimum hyper-parameters are then found by applying Bayesian Hyper-parameter optimization to the XGBoost, LightGBM, and SGD models. Individually, ten distinct machine learning models are tested, as well as combining them to create ensemble models. Three alternative ensemble solutions have been suggested. The same test set is used to test both individual and ensemble models, and the results of 46 model combinations demonstrate that stack ensemble models yield the desired R2 score of all. In conclusion, the ensemble model improves the prediction performance by about 10%.
    Decoupled Self-supervised Learning for Non-Homophilous Graphs. (arXiv:2206.03601v1 [cs.LG])
    In this paper, we study the problem of conducting self-supervised learning for node representation learning on non-homophilous graphs. Existing self-supervised learning methods typically assume the graph is homophilous where linked nodes often belong to the same class or have similar features. However, such assumptions of homophily do not always hold true in real-world graphs. We address this problem by developing a decoupled self-supervised learning (DSSL) framework for graph neural networks. DSSL imitates a generative process of nodes and links from latent variable modeling of the semantic structure, which decouples different underlying semantics between different neighborhoods into the self-supervised node learning process. Our DSSL framework is agnostic to the encoders and does not need prefabricated augmentations, thus is flexible to different graphs. To effectively optimize the framework with latent variables, we derive the evidence lower-bound of the self-supervised objective and develop a scalable training algorithm with variational inference. We provide a theoretical analysis to justify that DSSL enjoys better downstream performance. Extensive experiments on various types of graph benchmarks demonstrate that our proposed framework can significantly achieve better performance compared with competitive self-supervised learning baselines.
    Certifying Data-Bias Robustness in Linear Regression. (arXiv:2206.03575v1 [cs.LG])
    Datasets typically contain inaccuracies due to human error and societal biases, and these inaccuracies can affect the outcomes of models trained on such datasets. We present a technique for certifying whether linear regression models are pointwise-robust to label bias in the training dataset, i.e., whether bounded perturbations to the labels of a training dataset result in models that change the prediction of test points. We show how to solve this problem exactly for individual test points, and provide an approximate but more scalable method that does not require advance knowledge of the test point. We extensively evaluate both techniques and find that linear models -- both regression- and classification-based -- often display high levels of bias-robustness. However, we also unearth gaps in bias-robustness, such as high levels of non-robustness for certain bias assumptions on some datasets. Overall, our approach can serve as a guide for when to trust, or question, a model's output.
    Hub-Pathway: Transfer Learning from A Hub of Pre-trained Models. (arXiv:2206.03726v1 [cs.LG])
    Transfer learning aims to leverage knowledge from pre-trained models to benefit the target task. Prior transfer learning work mainly transfers from a single model. However, with the emergence of deep models pre-trained from different resources, model hubs consisting of diverse models with various architectures, pre-trained datasets and learning paradigms are available. Directly applying single-model transfer learning methods to each model wastes the abundant knowledge of the model hub and suffers from high computational cost. In this paper, we propose a Hub-Pathway framework to enable knowledge transfer from a model hub. The framework generates data-dependent pathway weights, based on which we assign the pathway routes at the input level to decide which pre-trained models are activated and passed through, and then set the pathway aggregation at the output level to aggregate the knowledge from different models to make predictions. The proposed framework can be trained end-to-end with the target task-specific loss, where it learns to explore better pathway configurations and exploit the knowledge in pre-trained models for each target datum. We utilize a noisy pathway generator and design an exploration loss to further explore different pathways throughout the model hub. To fully exploit the knowledge in pre-trained models, each model is further trained by specific data that activate it, which ensures its performance and enhances knowledge transfer. Experiment results on computer vision and reinforcement learning tasks demonstrate that the proposed Hub-Pathway framework achieves the state-of-the-art performance for model hub transfer learning.
    A Privacy-Preserving Subgraph-Level Federated Graph Neural Network via Differential Privacy. (arXiv:2206.03492v1 [cs.CR])
    Currently, the federated graph neural network (GNN) has attracted a lot of attention due to its wide applications in reality without violating the privacy regulations. Among all the privacy-preserving technologies, the differential privacy (DP) is the most promising one due to its effectiveness and light computational overhead. However, the DP-based federated GNN has not been well investigated, especially in the sub-graph-level setting, such as the scenario of recommendation system. The biggest challenge is how to guarantee the privacy and solve the non independent and identically distributed (non-IID) data in federated GNN simultaneously. In this paper, we propose DP-FedRec, a DP-based federated GNN to fill the gap. Private Set Intersection (PSI) is leveraged to extend the local graph for each client, and thus solve the non-IID problem. Most importantly, DP is applied not only on the weights but also on the edges of the intersection graph from PSI to fully protect the privacy of clients. The evaluation demonstrates DP-FedRec achieves better performance with the graph extension and DP only introduces little computations overhead.
    Network Report: A Structured Description for Network Datasets. (arXiv:2206.03635v1 [cs.SI])
    The rapid development of network science and technologies depends on shareable datasets. Currently, there is no standard practice for reporting and sharing network datasets. Some network dataset providers only share links, while others provide some contexts or basic statistics. As a result, critical information may be unintentionally dropped, and network dataset consumers may misunderstand or overlook critical aspects. Inappropriately using a network dataset can lead to severe consequences (e.g., discrimination) especially when machine learning models on networks are deployed in high-stake domains. Challenges arise as networks are often used across different domains (e.g., network science, physics, etc) and have complex structures. To facilitate the communication between network dataset providers and consumers, we propose network report. A network report is a structured description that summarizes and contextualizes a network dataset. Network report extends the idea of dataset reports (e.g., Datasheets for Datasets) from prior work with network-specific descriptions of the non-i.i.d. nature, demographic information, network characteristics, etc. We hope network reports encourage transparency and accountability in network research and development across different fields.
    Toward Certified Robustness Against Real-World Distribution Shifts. (arXiv:2206.03669v1 [cs.LG])
    We consider the problem of certifying the robustness of deep neural networks against real-world distribution shifts. To do so, we bridge the gap between hand-crafted specifications and realistic deployment settings by proposing a novel neural-symbolic verification framework, in which we train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model. A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations, which are fundamental to many state-of-the-art generative models. To address this challenge, we propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement. The key idea is to "lazily" refine the abstraction of sigmoid functions to exclude spurious counter-examples found in the previous abstraction, thus guaranteeing progress in the verification process while keeping the state-space small. Experiments on the MNIST and CIFAR-10 datasets show that our framework significantly outperforms existing methods on a range of challenging distribution shifts.
    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. (arXiv:2206.03688v1 [cs.LG])
    A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
    Unsupervised Single-shot Depth Estimation using Perceptual Reconstruction. (arXiv:2201.12170v4 [cs.CV] UPDATED)
    Real-time estimation of actual object depth is an essential module for various autonomous system tasks such as 3D reconstruction, scene understanding and condition assessment. During the last decade of machine learning, extensive deployment of deep learning methods to computer vision tasks has yielded approaches that succeed in achieving realistic depth synthesis out of a simple RGB modality. Most of these models are based on paired RGB-depth data and/or the availability of video sequences and stereo images. The lack of sequences, stereo data and RGB-depth pairs makes depth estimation a fully unsupervised single-image transfer problem that has barely been explored so far. This study builds on recent advances in the field of generative neural networks in order to establish fully unsupervised single-shot depth estimation. Two generators for RGB-to-depth and depth-to-RGB transfer are implemented and simultaneously optimized using the Wasserstein-1 distance, a novel perceptual reconstruction term and hand-crafted image filters. We comprehensively evaluate the models using industrial surface depth data as well as the Texas 3D Face Recognition Database, the CelebAMask-HQ database of human portraits and the SURREAL dataset that records body depth. For each evaluation dataset the proposed method shows a significant increase in depth accuracy compared to state-of-the-art single-image transfer methods.
    High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. (arXiv:2206.04030v1 [stat.ML])
    We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. Interestingly, we find a critical scaling regime for the step-size below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations.
    Probabilistically Robust Learning: Balancing Average- and Worst-case Performance. (arXiv:2202.01136v3 [cs.LG] UPDATED)
    Many of the successes of machine learning are based on minimizing an averaged loss function. However, it is well-known that this paradigm suffers from robustness issues that hinder its applicability in safety-critical domains. These issues are often addressed by training against worst-case perturbations of data, a technique known as adversarial training. Although empirically effective, adversarial training can be overly conservative, leading to unfavorable trade-offs between nominal performance and robustness. To this end, in this paper we propose a framework called probabilistic robustness that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case by enforcing robustness to most rather than to all perturbations. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning. From a practical point of view, we propose a novel algorithm based on risk-aware optimization that effectively balances average- and worst-case performance at a considerably lower computational cost relative to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate the advantages of this framework on the spectrum from average- to worst-case robustness.
    Machine-Learning the Classification of Spacetimes. (arXiv:2201.01644v2 [gr-qc] UPDATED)
    On the long-established classification problems in general relativity we take a novel perspective by adopting fruitful techniques from machine learning and modern data-science. In particular, we model Petrov's classification of spacetimes, and show that a feed-forward neural network can achieve high degree of success. We also show how data visualization techniques with dimensionality reduction can help analyze the underlying patterns in the structure of the different types of spacetimes.
    Decentralized Safe Multi-agent Stochastic Optimal Control using Deep FBSDEs and ADMM. (arXiv:2202.10658v2 [cs.MA] UPDATED)
    In this work, we propose a novel safe and scalable decentralized solution for multi-agent control in the presence of stochastic disturbances. Safety is mathematically encoded using stochastic control barrier functions and safe controls are computed by solving quadratic programs. Decentralization is achieved by augmenting to each agent's optimization variables, copy variables, for its neighbors. This allows us to decouple the centralized multi-agent optimization problem. However, to ensure safety, neighboring agents must agree on "what is safe for both of us" and this creates a need for consensus. To enable safe consensus solutions, we incorporate an ADMM-based approach. Specifically, we propose a Merged CADMM-OSQP implicit neural network layer, that solves a mini-batch of both, local quadratic programs as well as the overall consensus problem, as a single optimization problem. This layer is embedded within a Deep FBSDEs network architecture at every time step, to facilitate end-to-end differentiable, safe and decentralized stochastic optimal control. The efficacy of the proposed approach is demonstrated on several challenging multi-robot tasks in simulation. By imposing requirements on safety specified by collision avoidance constraints, the safe operation of all agents is ensured during the entire training process. We also demonstrate superior scalability in terms of computational and memory savings as compared to a centralized approach.
    Decision-Focused Learning without Decision-Making: Learning Locally Optimized Decision Losses. (arXiv:2203.16067v2 [cs.LG] UPDATED)
    Decision-Focused Learning (DFL) is a paradigm for tailoring a predictive model to a downstream optimization task that uses its predictions in order to perform better on that specific task. The main technical challenge associated with DFL is that it requires being able to differentiate through the optimization problem, which is difficult due to discontinuous solutions and other challenges. Past work has largely gotten around this this issue by handcrafting task-specific surrogates to the original optimization problem that provide informative gradients when differentiated through. However, the need to handcraft surrogates for each new task limits the usability of DFL. In addition, there are often no guarantees about the convexity of the resulting surrogates and, as a result, training a predictive model using them can lead to inferior local optima. In this paper, we do away with surrogates altogether and instead learn loss functions that capture task-specific information. To the best of our knowledge, ours is the first approach that entirely replaces the optimization component of decision-focused learning with a loss that is automatically learned. Our approach (a) only requires access to a black-box oracle that can solve the optimization problem and is thus generalizable, and (b) can be convex by construction and so can be easily optimized over. We evaluate our approach on three resource allocation problems from the literature and find that our approach outperforms learning without taking into account task-structure in all three domains, and even hand-crafted surrogates from the literature.
    Boosting the Confidence of Generalization for $L_2$-Stable Randomized Learning Algorithms. (arXiv:2206.03834v1 [stat.ML])
    Exponential generalization bounds with near-tight rates have recently been established for uniformly stable learning algorithms. The notion of uniform stability, however, is stringent in the sense that it is invariant to the data-generating distribution. Under the weaker and distribution dependent notions of stability such as hypothesis stability and $L_2$-stability, the literature suggests that only polynomial generalization bounds are possible in general cases. The present paper addresses this long standing tension between these two regimes of results and makes progress towards relaxing it inside a classic framework of confidence-boosting. To this end, we first establish an in-expectation first moment generalization error bound for potentially randomized learning algorithms with $L_2$-stability, based on which we then show that a properly designed subbagging process leads to near-tight exponential generalization bounds over the randomness of both data and algorithm. We further substantialize these generic results to stochastic gradient descent (SGD) to derive improved high-probability generalization bounds for convex or non-convex optimization problems with natural time decaying learning rates, which have not been possible to prove with the existing hypothesis stability or uniform stability based results.
    Multi-channel neural networks for predicting influenza A virus hosts and antigenic types. (arXiv:2206.03823v1 [q-bio.QM])
    Influenza occurs every season and occasionally causes pandemics. Despite its low mortality rate, influenza is a major public health concern, as it can be complicated by severe diseases like pneumonia. A fast, accurate and low-cost method to predict the origin host and subtype of influenza viruses could help reduce virus transmission and benefit resource-poor areas. In this work, we propose multi-channel neural networks to predict antigenic types and hosts of influenza A viruses with hemagglutinin and neuraminidase protein sequences. An integrated data set containing complete protein sequences were used to produce a pre-trained model, and two other data sets were used for testing the model's performance. One test set contained complete protein sequences, and another test set contained incomplete protein sequences. The results suggest that multi-channel neural networks are applicable and promising for predicting influenza A virus hosts and antigenic subtypes with complete and partial protein sequences.
    NOMAD: Nonlinear Manifold Decoders for Operator Learning. (arXiv:2206.03551v1 [cs.LG])
    Supervised learning in function spaces is an emerging area of machine learning research with applications to the prediction of complex physical systems such as fluid flows, solid mechanics, and climate modeling. By directly learning maps (operators) between infinite dimensional function spaces, these models are able to learn discretization invariant representations of target functions. A common approach is to represent such target functions as linear combinations of basis elements learned from data. However, there are simple scenarios where, even though the target functions form a low dimensional submanifold, a very large number of basis elements is needed for an accurate linear representation. Here we present NOMAD, a novel operator learning framework with a nonlinear decoder map capable of learning finite dimensional representations of nonlinear submanifolds in function spaces. We show this method is able to accurately learn low dimensional representations of solution manifolds to partial differential equations while outperforming linear models of larger size. Additionally, we compare to state-of-the-art operator learning methods on a complex fluid dynamics benchmark and achieve competitive performance with a significantly smaller model size and training cost.
    Transfer learning to decode brain states reflecting the relationship between cognitive tasks. (arXiv:2206.03950v1 [q-bio.NC])
    Transfer learning improves the performance of the target task by leveraging the data of a specific source task: the closer the relationship between the source and the target tasks, the greater the performance improvement by transfer learning. In neuroscience, the relationship between cognitive tasks is usually represented by similarity of activated brain regions or neural representation. However, no study has linked transfer learning and neuroscience to reveal the relationship between cognitive tasks. In this study, we propose a transfer learning framework to reflect the relationship between cognitive tasks, and compare the task relations reflected by transfer learning and by the overlaps of brain regions (e.g., neurosynth). Our results of transfer learning create cognitive taskonomy to reflect the relationship between cognitive tasks which is well in line with the task relations derived from neurosynth. Transfer learning performs better in task decoding with fMRI data if the source and target cognitive tasks activate similar brain regions. Our study uncovers the relationship of multiple cognitive tasks and provides guidance for source task selection in transfer learning for neural decoding based on small-sample data.
    Hybrid Physics and Deep Learning Model for Interpretable Vehicle State Prediction. (arXiv:2103.06727v3 [cs.LG] UPDATED)
    Physical motion models offer interpretable predictions for the motion of vehicles. However, some model parameters, such as those related to aero- and hydrodynamics, are expensive to measure and are often only roughly approximated reducing prediction accuracy. Recurrent neural networks achieve high prediction accuracy at low cost, as they can use cheap measurements collected during routine operation of the vehicle, but their results are hard to interpret. To precisely predict vehicle states without expensive measurements of physical parameters, we propose a hybrid approach combining deep learning and physical motion models including a novel two-phase training procedure. We achieve interpretability by restricting the output range of the deep neural network as part of the hybrid model, which limits the uncertainty introduced by the neural network to a known quantity. We have evaluated our approach for the use case of ship and quadcopter motion. The results show that our hybrid model can improve model interpretability with no decrease in accuracy compared to existing deep learning approaches.
    Mathematical model bridges disparate timescales of lifelong learning. (arXiv:2206.03954v1 [physics.soc-ph])
    Lifelong learning occurs on timescales ranging from minutes to decades. People can lose themselves in a new skill, practicing for hours until exhausted. And they can pursue mastery over days or decades, perhaps abandoning old skills entirely to seek out new challenges. A full understanding of learning requires an account that integrates these timescales. Here, we present a minimal quantitative model that unifies the nested timescales of learning. Our dynamical model recovers classic accounts of skill acquisition, and describes how learning emerges from moment-to-moment dynamics of motivation, fatigue, and work, while also situated within longer-term dynamics of skill selection, mastery, and abandonment. We apply this model to explore the benefits and pitfalls of a variety of training regimes and to characterize individual differences in motivation and skill development. Our model connects previously disparate timescales -- and the subdisciplines that typically study each timescale in isolation -- to offer a unified account of the timecourse of skill acquisition.
    FedHPO-B: A Benchmark Suite for Federated Hyperparameter Optimization. (arXiv:2206.03966v1 [cs.LG])
    Hyperparameter optimization (HPO) is crucial for machine learning algorithms to achieve satisfactory performance, whose progress has been boosted by related benchmarks. Nonetheless, existing efforts in benchmarking all focus on HPO for traditional centralized learning while ignoring federated learning (FL), a promising paradigm for collaboratively learning models from dispersed data. In this paper, we first identify some uniqueness of HPO for FL algorithms from various aspects. Due to this uniqueness, existing HPO benchmarks no longer satisfy the need to compare HPO methods in the FL setting. To facilitate the research of HPO in the FL setting, we propose and implement a benchmark suite FedHPO-B that incorporates comprehensive FL tasks, enables efficient function evaluations, and eases continuing extensions. We also conduct extensive experiments based on FedHPO-B to benchmark a few HPO methods. We open-source FedHPO-B at https://github.com/alibaba/FederatedScope/tree/master/benchmark/FedHPOB and will maintain it actively.
    Boundary between noise and information applied to filtering neural network weight matrices. (arXiv:2206.03927v1 [cond-mat.dis-nn])
    Deep neural networks have been successfully applied to a broad range of problems where overparametrization yields weight matrices which are partially random. A comparison of weight matrix singular vectors to the Porter-Thomas distribution suggests that there is a boundary between randomness and learned information in the singular value spectrum. Inspired by this finding, we introduce an algorithm for noise filtering, which both removes small singular values and reduces the magnitude of large singular values to counteract the effect of level repulsion between the noise and the information part of the spectrum. For networks trained in the presence of label noise, we indeed find that the generalization performance improves significantly due to noise filtering.
    Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners. (arXiv:2206.04046v1 [cs.CV])
    Domain generalization (DG) aims at learning generalizable models under distribution shifts to avoid redundantly overfitting massive training data. Previous works with complex loss design and gradient constraint have not yet led to empirical success on large-scale benchmarks. In this work, we reveal the mixture-of-experts (MoE) model's generalizability on DG by leveraging to distributively handle multiple aspects of the predictive features across domains. To this end, we propose Sparse Fusion Mixture-of-Experts (SF-MoE), which incorporates sparsity and fusion mechanisms into the MoE framework to keep the model both sparse and predictive. SF-MoE has two dedicated modules: 1) sparse block and 2) fusion block, which disentangle and aggregate the diverse learned signals of an object, respectively. Extensive experiments demonstrate that SF-MoE is a domain-generalizable learner on large-scale benchmarks. It outperforms state-of-the-art counterparts by more than 2% across 5 large-scale DG datasets (e.g., DomainNet), with the same or even lower computational costs. We further reveal the internal mechanism of SF-MoE from distributed representation perspective (e.g., visual attributes). We hope this framework could facilitate future research to push generalizable object recognition to the real world. Code and models are released at https://github.com/Luodian/SF-MoE-DG.
    Scalable Online Disease Diagnosis via Multi-Model-Fused Actor-Critic Reinforcement Learning. (arXiv:2206.03659v1 [cs.LG])
    For those seeking healthcare advice online, AI based dialogue agents capable of interacting with patients to perform automatic disease diagnosis are a viable option. This application necessitates efficient inquiry of relevant disease symptoms in order to make accurate diagnosis recommendations. This can be formulated as a problem of sequential feature (symptom) selection and classification for which reinforcement learning (RL) approaches have been proposed as a natural solution. They perform well when the feature space is small, that is, the number of symptoms and diagnosable disease categories is limited, but they frequently fail in assignments with a large number of features. To address this challenge, we propose a Multi-Model-Fused Actor-Critic (MMF-AC) RL framework that consists of a generative actor network and a diagnostic critic network. The actor incorporates a Variational AutoEncoder (VAE) to model the uncertainty induced by partial observations of features, thereby facilitating in making appropriate inquiries. In the critic network, a supervised diagnosis model for disease predictions is involved to precisely estimate the state-value function. Furthermore, inspired by the medical concept of differential diagnosis, we combine the generative and diagnosis models to create a novel reward shaping mechanism to address the sparse reward problem in large search spaces. We conduct extensive experiments on both synthetic and real-world datasets for empirical evaluations. The results demonstrate that our approach outperforms state-of-the-art methods in terms of diagnostic accuracy and interaction efficiency while also being more effectively scalable to large search spaces. Besides, our method is adaptable to both categorical and continuous features, making it ideal for online applications.
    Integrating Symmetry into Differentiable Planning. (arXiv:2206.03674v1 [cs.LG])
    We study how group symmetry helps improve data efficiency and generalization for end-to-end differentiable planning algorithms, specifically on 2D robotic path planning problems: navigation and manipulation. We first formalize the idea from Value Iteration Networks (VINs) on using convolutional networks for path planning, because it avoids explicitly constructing equivalence classes and enable end-to-end planning. We then show that value iteration can always be represented as some convolutional form for (2D) path planning, and name the resulting paradigm Symmetric Planner (SymPlan). In implementation, we use steerable convolution networks to incorporate symmetry. Our algorithms on navigation and manipulation, with given or learned maps, improve training efficiency and generalization performance by large margins over non-equivariant counterparts, VIN and GPPN.
    Deeper-GXX: Deepening Arbitrary GNNs. (arXiv:2110.13798v2 [cs.LG] UPDATED)
    Shallow GNNs tend to have sub-optimal performance dealing with large-scale graphs or graphs with missing features. Therefore, it is necessary to increase the depth (i.e., the number of layers) of GNNs to capture more latent knowledge of the input data. On the other hand, including more layers in GNNs typically decreases their performance due to, e.g., vanishing gradient and oversmoothing. Existing methods (e.g., PairNorm and DropEdge) mainly focus on addressing oversmoothing, but they suffer from some drawbacks such as requiring hard-to-acquire knowledge or having large training randomness. In addition, these methods simply incorporate ResNet to address vanishing gradient. They ignore an important fact: by stacking more and more layers with ResNet architecture, the information collected from faraway neighbors becomes dominant, compared with the information collected from the 1-hop and 2-hop neighbors, thus resulting in severe performance degradation. In this paper, we first go deep into the architecture of ResNet and analyze why ResNet is not best suited for deeper GNNs. Then we propose a new residual architecture to attenuate the negative impact caused by ResNet. To address the drawbacks of these existing methods, we introduce the Topology-guided Graph Contrastive Loss named TGCL. It utilizes node topological information and pulls the connected node pairs closer via contrastive learning regularization to obtain discriminative node representations. Combining the new residual architecture with TGCL, an end-to-end framework named Deeper-GXX is proposed towards deeper GNNs. The extensive experiments on real-world data sets demonstrate the effectiveness and efficiency of Deeper-GXX compared with state-of-the-art baselines.
    Mapping the Internet: Modelling Entity Interactions in Complex Heterogeneous Networks. (arXiv:2104.09650v2 [cs.LG] UPDATED)
    Even though machine learning algorithms already play a significant role in data science, many current methods pose unrealistic assumptions on input data. The application of such methods is difficult due to incompatible data formats, or heterogeneous, hierarchical or entirely missing data fragments in the dataset. As a solution, we propose a versatile, unified framework called `HMill' for sample representation, model definition and training. We review in depth a multi-instance paradigm for machine learning that the framework builds on and extends. To theoretically justify the design of key components of HMill, we show an extension of the universal approximation theorem to the set of all functions realized by models implemented in the framework. The text also contains a detailed discussion on technicalities and performance improvements in our implementation, which is published for download under the MIT License. The main asset of the framework is its flexibility, which makes modelling of diverse real-world data sources with the same tool possible. Additionally to the standard setting in which a set of attributes is observed for each object individually, we explain how message-passing inference in graphs that represent whole systems of objects can be implemented in the framework. To support our claims, we solve three different problems from the cybersecurity domain using the framework. The first use case concerns IoT device identification from raw network observations. In the second problem, we study how malicious binary files can be classified using a snapshot of the operating system represented as a directed graph. The last provided example is a task of domain blacklist extension through modelling interactions between entities in the network. In all three problems, the solution based on the proposed framework achieves performance comparable to specialized approaches.
    General Greedy De-bias Learning. (arXiv:2112.10572v3 [cs.LG] UPDATED)
    Neural networks often make predictions relying on the spurious correlations from the datasets rather than the intrinsic properties of the task of interest, facing sharp degradation on out-of-distribution (OOD) test data. Existing de-bias learning frameworks try to capture specific dataset bias by annotations but they fail to handle complicated OOD scenarios. Others implicitly identify the dataset bias by special design low capability biased models or losses, but they degrade when the training and testing data are from the same distribution. In this paper, we propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model. The base model is encouraged to focus on examples that are hard to solve with biased models, thus remaining robust against spurious correlations in the test stage. GGD largely improves models' OOD generalization ability on various tasks, but sometimes over-estimates the bias level and degrades on the in-distribution test. We further re-analyze the ensemble process of GGD and introduce the Curriculum Regularization inspired by curriculum learning, which achieves a good trade-off between in-distribution and out-of-distribution performance. Extensive experiments on image classification, adversarial question answering, and visual question answering demonstrate the effectiveness of our method. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.  ( 2 min )
    Decentralized Online Regularized Learning Over Random Time-Varying Graphs. (arXiv:2206.03861v1 [cs.LG])
    We study the decentralized online regularized linear regression algorithm over random time-varying graphs. At each time step, every node runs an online estimation algorithm consisting of an innovation term processing its own new measurement, a consensus term taking a weighted sum of estimations of its own and its neighbors with additive and multiplicative communication noises and a regularization term preventing over-fitting. It is not required that the regression matrices and graphs satisfy special statistical assumptions such as mutual independence, spatio-temporal independence or stationarity. We develop the nonnegative supermartingale inequality of the estimation error, and prove that the estimations of all nodes converge to the unknown true parameter vector almost surely if the algorithm gains, graphs and regression matrices jointly satisfy the sample path spatio-temporal persistence of excitation condition. Especially, this condition holds by choosing appropriate algorithm gains if the graphs are uniformly conditionally jointly connected and conditionally balanced, and the regression models of all nodes are uniformly conditionally spatio-temporally jointly observable, under which the algorithm converges in mean square and almost surely. In addition, we prove that the regret upper bound $\mathcal O(T^{1-\tau}\ln T)$, where $\tau\in (0.5,1)$ is a constant depending on the algorithm gains.  ( 2 min )
    pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning. (arXiv:2206.03655v1 [cs.LG])
    Personalized Federated Learning (pFL) has gained increasing attention in recent years due to its success in handling the statistical heterogeneity of FL clients via utilizing and deploying distinct local models. However, standardized evaluation and systematical analysis of diverse pFL methods remain a challenge. Firstly, the highly varied datasets, FL simulation settings and pFL implementations impede the fast and fair pFL comparison. Secondly, the effectiveness and robustness of pFL methods are under-explored in various practical scenarios, such as new clients generalization and resource-limited clients participation. Finally, the current pFL literature diverges in the adopted evaluation and ablation protocols. To tackle these challenges, we propose the first comprehensive pFL benchmark, pFL-Bench, for facilitating rapid, reproducible, standardized and thorough pFL evaluation. The proposed benchmark contains 9 datasets in diverse application domains with unified data partition and realistic heterogeneous settings; a modular and easy-to-extend pFL codebase with more than 20 competitive pFL baseline implementations; and systematic evaluations under containerized environments in terms of generalization, fairness, system overhead, and convergence. We highlight the benefits and potential of SOTA pFL methods and hope pFL-Bench enables further pFL research and broad applications that would otherwise be difficult owing to the absence of a dedicated benchmark. The code is released at https://github.com/alibaba/FederatedScope/tree/master/benchmark/pFL-Bench.  ( 2 min )
    Predict better with less training data using a QNN. (arXiv:2206.03960v1 [quant-ph])
    Over the past decade, machine learning revolutionized vision-based quality assessment for which convolutional neural networks (CNNs) have now become the standard. In this paper, we consider a potential next step in this development and describe a quanvolutional neural network (QNN) algorithm that efficiently maps classical image data to quantum states and allows for reliable image analysis. We practically demonstrate how to leverage quantum devices in computer vision and how to introduce quantum convolutions into classical CNNs. Dealing with a real world use case in industrial quality control, we implement our hybrid QNN model within the PennyLane framework and empirically observe it to achieve better predictions using much fewer training data than classical CNNs. In other words, we empirically observe a genuine quantum advantage for an industrial application where the advantage is due to superior data encoding.  ( 2 min )
    Efficient Resource Allocation with Fairness Constraints in Restless Multi-Armed Bandits. (arXiv:2206.03883v1 [cs.LG])
    Restless Multi-Armed Bandits (RMAB) is an apt model to represent decision-making problems in public health interventions (e.g., tuberculosis, maternal, and child care), anti-poaching planning, sensor monitoring, personalized recommendations and many more. Existing research in RMAB has contributed mechanisms and theoretical results to a wide variety of settings, where the focus is on maximizing expected value. In this paper, we are interested in ensuring that RMAB decision making is also fair to different arms while maximizing expected value. In the context of public health settings, this would ensure that different people and/or communities are fairly represented while making public health intervention decisions. To achieve this goal, we formally define the fairness constraints in RMAB and provide planning and learning methods to solve RMAB in a fair manner. We demonstrate key theoretical properties of fair RMAB and experimentally demonstrate that our proposed methods handle fairness constraints without sacrificing significantly on solution quality.  ( 2 min )
    Error Rates for Kernel Classification under Source and Capacity Conditions. (arXiv:2201.12655v2 [stat.ML] UPDATED)
    We consider the problem of kernel classification. Works on kernel regression have shown that the rate of decay of the prediction error with the number of samples for a large class of data-sets is well characterized by two quantities: the capacity and source of the data-set. In this work, we compute the decay rates for the misclassification (prediction) error under the Gaussian design, for data-sets satisfying source and capacity assumptions. We derive the rates as a function of the source and capacity coefficients for two standard kernel classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification, and contrast the two methods. As a consequence, we find that the known worst-case rates are loose for this class of data-sets. Finally, we show that the rates presented in this work are also observed on real data-sets.  ( 2 min )
    Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation. (arXiv:2206.03588v1 [cs.LG])
    Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study ommunication compression and aggregation mechanisms for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of three point compressors (3PC) of Richtarik et al. [2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as contractive compression} and lazy aggregation, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast condition-number-independent local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve state-of-the-art communication complexity compared to several key baselines using second-order information.
    Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning. (arXiv:2206.03715v1 [cs.AI])
    Commonsense reasoning systems should be able to generalize to diverse reasoning cases. However, most state-of-the-art approaches depend on expensive data annotations and overfit to a specific benchmark without learning how to perform general semantic reasoning. To overcome these drawbacks, zero-shot QA systems have shown promise as a robust learning scheme by transforming a commonsense knowledge graph (KG) into synthetic QA-form samples for model training. Considering the increasing type of different commonsense KGs, this paper aims to extend the zero-shot transfer learning scenario into multiple-source settings, where different KGs can be utilized synergetically. Towards this goal, we propose to mitigate the loss of knowledge from the interference among the different knowledge sources, by developing a modular variant of the knowledge aggregation as a new zero-shot commonsense reasoning framework. Results on five commonsense reasoning benchmarks demonstrate the efficacy of our framework, improving the performance with multiple KGs.
    Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits. (arXiv:2206.03520v1 [stat.ML])
    We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a $K$-armed bandit with exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.
    Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure. (arXiv:2206.03569v1 [cs.LG])
    The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $\epsilon$-optimal policy is $\Tilde{\Omega}\left(|S||A|H^3 / \eps^2\right)$ over worst case instances of an MDP with state space $S$, action space $A$, and horizon $H$. We consider a class of MDPs that exhibit low rank structure, where the latent features are unknown. We argue that a natural combination of value iteration and low-rank matrix estimation results in an estimation error that grows doubly exponentially in the horizon $H$. We then provide a new algorithm along with statistical guarantees that efficiently exploits low rank structure given access to a generative model, achieving a sample complexity of $\Tilde{O}\left(d^5(|S|+|A|)\mathrm{poly}(H)/\eps^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $\eps$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.
    White-box Membership Attack Against Machine Learning Based Retinopathy Classification. (arXiv:2206.03584v1 [cs.CR])
    The advances in machine learning (ML) have greatly improved AI-based diagnosis aid systems in medical imaging. However, being based on collecting medical data specific to individuals induces several security issues, especially in terms of privacy. Even though the owner of the images like a hospital put in place strict privacy protection provisions at the level of its information system, the model trained over his images still holds disclosure potential. The trained model may be accessible to an attacker as: 1) White-box: accessing to the model architecture and parameters; 2) Black box: where he can only query the model with his own inputs through an appropriate interface. Existing attack methods include: feature estimation attacks (FEA), membership inference attack (MIA), model memorization attack (MMA) and identification attacks (IA). In this work we focus on MIA against a model that has been trained to detect diabetic retinopathy from retinal images. Diabetic retinopathy is a condition that can cause vision loss and blindness in the people who have diabetes. MIA is the process of determining whether a data sample comes from the training data set of a trained ML model or not. From a privacy perspective in our use case where a diabetic retinopathy classification model is given to partners that have at their disposal images along with patients' identifiers, inferring the membership status of a data sample can help to state if a patient has contributed or not to the training of the model.
    Predictive Modeling of Charge Levels for Battery Electric Vehicles using CNN EfficientNet and IGTD Algorithm. (arXiv:2206.03612v1 [cs.CV])
    Convolutional Neural Networks (CNN) have been a good solution for understanding a vast image dataset. As the increased number of battery-equipped electric vehicles is flourishing globally, there has been much research on understanding which charge levels electric vehicle drivers would choose to charge their vehicles to get to their destination without any prevention. We implemented deep learning approaches to analyze the tabular datasets to understand their state of charge and which charge levels they would choose. In addition, we implemented the Image Generator for Tabular Dataset algorithm to utilize tabular datasets as image datasets to train convolutional neural networks. Also, we integrated other CNN architecture such as EfficientNet to prove that CNN is a great learner for reading information from images that were converted from the tabular dataset, and able to predict charge levels for battery-equipped electric vehicles. We also evaluated several optimization methods to enhance the learning rate of the models and examined further analysis on improving the model architecture.
    On gradient descent training under data augmentation with on-line noisy copies. (arXiv:2206.03734v1 [stat.ML])
    In machine learning, data augmentation (DA) is a technique for improving the generalization performance. In this paper, we mainly considered gradient descent of linear regression under DA using noisy copies of datasets, in which noise is injected into inputs. We analyzed the situation where random noisy copies are newly generated and used at each epoch; i.e., the case of using on-line noisy copies. Therefore, it is viewed as an analysis on a method using noise injection into training process by DA manner; i.e., on-line version of DA. We derived the averaged behavior of training process under three situations which are the full-batch training under the sum of squared errors, the full-batch and mini-batch training under the mean squared error. We showed that, in all cases, training for DA with on-line copies is approximately equivalent to a ridge regression training whose regularization parameter corresponds to the variance of injected noise. On the other hand, we showed that the learning rate is multiplied by the number of noisy copies plus one in full-batch under the sum of squared errors and the mini-batch under the mean squared error; i.e., DA with on-line copies yields apparent acceleration of training. The apparent acceleration and regularization effect come from the original part and noise in a copy data respectively. These results are confirmed in a numerical experiment. In the numerical experiment, we found that our result can be approximately applied to usual off-line DA in under-parameterization scenario and can not in over-parametrization scenario. Moreover, we experimentally investigated the training process of neural networks under DA with off-line noisy copies and found that our analysis on linear regression is possible to be applied to neural networks.
    Fairness-Aware PAC Learning from Corrupted Data. (arXiv:2102.06004v3 [cs.LG] UPDATED)
    Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data limit.
    Neural Collapse: A Review on Modelling Principles and Generalization. (arXiv:2206.04041v1 [cs.LG])
    With a recent observation of the "Neural Collapse (NC)" phenomena by Papyan et al., various efforts have been made to model it and analyse the implications. Neural collapse describes that in deep classifier networks, the class features of the final hidden layer associated with training data tend to collapse to the respective class feature means. Thus, simplifying the behaviour of the last layer classifier to that of a nearest-class center decision rule. In this work, we analyse the principles which aid in modelling such a phenomena from the ground up and show how they can build a common understanding of the recently proposed models that try to explain NC. We hope that our analysis presents a multifaceted perspective on modelling NC and aids in forming connections with the generalization capabilities of neural networks. Finally, we conclude by discussing the avenues for further research and propose potential research problems.
    Dataset Condensation with Contrastive Signals. (arXiv:2202.02916v2 [cs.CV] UPDATED)
    Recent studies have demonstrated that gradient matching-based dataset synthesis, or dataset condensation (DC), methods can achieve state-of-the-art performance when applied to data-efficient learning tasks. However, in this study, we prove that the existing DC methods can perform worse than the random selection method when task-irrelevant information forms a significant part of the training dataset. We attribute this to the lack of participation of the contrastive signals between the classes resulting from the class-wise gradient matching strategy. To address this problem, we propose Dataset Condensation with Contrastive signals (DCC) by modifying the loss function to enable the DC methods to effectively capture the differences between classes. In addition, we analyze the new loss function in terms of training dynamics by tracking the kernel velocity. Furthermore, we introduce a bi-level warm-up strategy to stabilize the optimization. Our experimental results indicate that while the existing methods are ineffective for fine-grained image classification tasks, the proposed method can successfully generate informative synthetic datasets for the same tasks. Moreover, we demonstrate that the proposed method outperforms the baselines even on benchmark datasets such as SVHN, CIFAR-10, and CIFAR-100. Finally, we demonstrate the high applicability of the proposed method by applying it to continual learning tasks.
    Continuous LWE is as Hard as LWE & Applications to Learning Gaussian Mixtures. (arXiv:2204.02550v2 [cs.CR] UPDATED)
    We show direct and conceptually simple reductions between the classical learning with errors (LWE) problem and its continuous analog, CLWE (Bruna, Regev, Song and Tang, STOC 2021). This allows us to bring to bear the powerful machinery of LWE-based cryptography to the applications of CLWE. For example, we obtain the hardness of CLWE under the classical worst-case hardness of the gap shortest vector problem. Previously, this was known only under quantum worst-case hardness of lattice problems. More broadly, with our reductions between the two problems, any future developments to LWE will also apply to CLWE and its downstream applications. As a concrete application, we show an improved hardness result for density estimation for mixtures of Gaussians. In this computational problem, given sample access to a mixture of Gaussians, the goal is to output a function that estimates the density function of the mixture. Under the (plausible and widely believed) exponential hardness of the classical LWE problem, we show that Gaussian mixture density estimation in $\mathbb{R}^n$ with roughly $\log n$ Gaussian components given $\mathsf{poly}(n)$ samples requires time quasi-polynomial in $n$. Under the (conservative) polynomial hardness of LWE, we show hardness of density estimation for $n^{\epsilon}$ Gaussians for any constant $\epsilon > 0$, which improves on Bruna, Regev, Song and Tang (STOC 2021), who show hardness for at least $\sqrt{n}$ Gaussians under polynomial (quantum) hardness assumptions. Our key technical tool is a reduction from classical LWE to LWE with $k$-sparse secrets where the multiplicative increase in the noise is only $O(\sqrt{k})$, independent of the ambient dimension $n$.
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v12 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of time-dependent covariates and treatments. Using simulated survival datasets, the TCS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In a large clinical cohort study, TCS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. TCS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions. The use of a propensity score layer and potential outcome subnetworks helps correcting for selection bias. However, the proposed model is limited in its ability to correct the bias from unmeasured confounding, and more extensive testing of TCS under extreme scenarios such as low overlapping and the presence of unmeasured confounders is desired and left for future work.  ( 3 min )
    Learning Pruned Structure and Weights Simultaneously from Scratch: an Attention based Approach. (arXiv:2111.02399v2 [cs.LG] UPDATED)
    As a deep learning model typically contains millions of trainable weights, there has been a growing demand for a more efficient network structure with reduced storage space and improved run-time efficiency. Pruning is one of the most popular network compression techniques. In this paper, we propose a novel unstructured pruning pipeline, Attention-based Simultaneous sparse structure and Weight Learning (ASWL). Unlike traditional channel-wise or weight-wise attention mechanism, ASWL proposed an efficient algorithm to calculate the pruning ratio through layer-wise attention for each layer, and both weights for the dense network and the sparse network are tracked so that the pruned structure is simultaneously learned from randomly initialized weights. Our experiments on MNIST, Cifar10, and ImageNet show that ASWL achieves superior pruning results in terms of accuracy, pruning ratio and operating efficiency when compared with state-of-the-art network pruning methods.  ( 2 min )
    Quantum continual learning of quantum data realizing knowledge backward transfer. (arXiv:2203.14032v2 [quant-ph] UPDATED)
    For the goal of strong artificial intelligence that can mimic human-level intelligence, AI systems would have the ability to adapt to ever-changing scenarios and learn new knowledge continuously without forgetting previously acquired knowledge. When a machine learning model is consecutively trained on multiple tasks that come in sequence, its performance on previously learned tasks may drop dramatically during the learning process of the newly seen task. To avoid this phenomenon termed catastrophic forgetting, continual learning, also known as lifelong learning, has been proposed and become one of the most up-to-date research areas of machine learning. As quantum machine learning blossoms in recent years, it is interesting to develop quantum continual learning. This paper focuses on the case of quantum models for quantum data where the computation model and the data to be processed are both quantum. The gradient episodic memory method is incorporated to design a quantum continual learning scheme that overcomes catastrophic forgetting and realizes knowledge backward transfer. Specifically, a sequence of quantum state classification tasks is continually learned by a variational quantum classifier whose parameters are optimized by a classical gradient-based optimizer. The gradient of the current task is projected to the closest gradient, avoiding the increase of the loss at previous tasks, but allowing the decrease. Numerical simulation results show that our scheme not only overcomes catastrophic forgetting, but also realize knowledge backward transfer, which means the classifier's performance on previous tasks is enhanced rather than compromised while learning a new task.  ( 2 min )
    Narrowing the Coordinate-frame Gap in Behavior Prediction Models: Distillation for Efficient and Accurate Scene-centric Motion Forecasting. (arXiv:2206.03970v1 [cs.CV])
    Behavior prediction models have proliferated in recent years, especially in the popular real-world robotics application of autonomous driving, where representing the distribution over possible futures of moving agents is essential for safe and comfortable motion planning. In these models, the choice of coordinate frames to represent inputs and outputs has crucial trade offs which broadly fall into one of two categories. Agent-centric models transform inputs and perform inference in agent-centric coordinates. These models are intrinsically invariant to translation and rotation between scene elements, are best-performing on public leaderboards, but scale quadratically with the number of agents and scene elements. Scene-centric models use a fixed coordinate system to process all agents. This gives them the advantage of sharing representations among all agents, offering efficient amortized inference computation which scales linearly with the number of agents. However, these models have to learn invariance to translation and rotation between scene elements, and typically underperform agent-centric models. In this work, we develop knowledge distillation techniques between probabilistic motion forecasting models, and apply these techniques to close the gap in performance between agent-centric and scene-centric models. This improves scene-centric model performance by 13.2% on the public Argoverse benchmark, 7.8% on Waymo Open Dataset and up to 9.4% on a large In-House dataset. These improved scene-centric models rank highly in public leaderboards and are up to 15 times more efficient than their agent-centric teacher counterparts in busy scenes.  ( 2 min )
    Between Stochastic and Adversarial Online Convex Optimization: Improved Regret Bounds via Smoothness. (arXiv:2202.07554v2 [cs.LG] UPDATED)
    Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing, for example, adversarially poisoned rounds, which were previously considered in the expert and bandit setting. Our results extend this to the online convex optimization framework. In the fully i.i.d. case, our bounds match the rates one would expect from results in stochastic acceleration, and in the fully adversarial case they gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes in terms of the stochastic variance and the adversarial variation of the loss gradients.  ( 2 min )
    Model-Free $\mu$ Synthesis via Adversarial Reinforcement Learning. (arXiv:2111.15537v2 [cs.LG] UPDATED)
    Motivated by the recent empirical success of policy-based reinforcement learning (RL), there has been a research trend studying the performance of policy-based RL methods on standard control benchmark problems. In this paper, we examine the effectiveness of policy-based RL methods on an important robust control problem, namely $\mu$ synthesis. We build a connection between robust adversarial RL and $\mu$ synthesis, and develop a model-free version of the well-known $DK$-iteration for solving state-feedback $\mu$ synthesis with static $D$-scaling. In the proposed algorithm, the $K$ step mimics the classical central path algorithm via incorporating a recently-developed double-loop adversarial RL method as a subroutine, and the $D$ step is based on model-free finite difference approximation. Extensive numerical study is also presented to demonstrate the utility of our proposed model-free algorithm. Our study sheds new light on the connections between adversarial RL and robust control.  ( 2 min )
    Diversity vs. Recognizability: Human-like generalization in one-shot generative models. (arXiv:2205.10370v2 [cs.AI] UPDATED)
    Robust generalization to new concepts has long remained a distinctive feature of human intelligence. However, recent progress in deep generative models has now led to neural architectures capable of synthesizing novel instances of unknown visual concepts from a single training example. Yet, a more precise comparison between these models and humans is not possible because existing performance metrics for generative models (i.e., FID, IS, likelihood) are not appropriate for the one-shot generation scenario. Here, we propose a new framework to evaluate one-shot generative models along two axes: sample recognizability vs. diversity (i.e., intra-class variability). Using this framework, we perform a systematic evaluation of representative one-shot generative models on the Omniglot handwritten dataset. We first show that GAN-like and VAE-like models fall on opposite ends of the diversity-recognizability space. Extensive analyses of the effect of key model parameters further revealed that spatial attention and context integration have a linear contribution to the diversity-recognizability trade-off. In contrast, disentanglement transports the model along a parabolic curve that could be used to maximize recognizability. Using the diversity-recognizability framework, we were able to identify models and parameters that closely approximate human data.  ( 2 min )
    Few-Shot Audio-Visual Learning of Environment Acoustics. (arXiv:2206.04006v1 [cs.SD])
    Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and--in a major departure from traditional methods--generalizing to novel environments in a few-shot manner. Project: this http URL  ( 2 min )
    Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations. (arXiv:2008.02965v2 [cs.LG] UPDATED)
    Using weight decay to penalize the L2 norms of weights in neural networks has been a standard training practice to regularize the complexity of networks. In this paper, we show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with positively homogeneous activation functions, such as linear, ReLU and max-pooling functions. As a result of homogeneity, functions specified by the networks are invariant to the shifting of weight scales between layers. The ineffective regularizers are sensitive to such shifting and thus poorly regularize the model capacity, leading to overfitting. To address this shortcoming, we propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network. The derived regularizer is an upper bound for the input gradient of the network so minimizing the improved regularizer also benefits the adversarial robustness. Residual connections are also considered and we show that our regularizer also forms an upper bound to input gradients of such a residual network. We demonstrate the efficacy of our proposed regularizer on various datasets and neural network architectures at improving generalization and adversarial robustness.  ( 2 min )
    Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-implementation Guidelines. (arXiv:2206.03944v1 [cs.LG])
    Online reinforcement learning (RL) algorithms are increasingly used to personalize digital interventions in the fields of mobile health and online education. Common challenges in designing and testing an RL algorithm in these settings include ensuring the RL algorithm can learn and run stably under real-time constraints, and accounting for the complexity of the environment, e.g., a lack of accurate mechanistic models for the user dynamics. To guide how one can tackle these challenges, we extend the PCS (Predictability, Computability, Stability) framework, a data science framework that incorporates best practices from machine learning and statistics in supervised learning (Yu and Kumbier, 2020), to the design of RL algorithms for the digital interventions setting. Further, we provide guidelines on how to design simulation environments, a crucial tool for evaluating RL candidate algorithms using the PCS framework. We illustrate the use of the PCS framework for designing an RL algorithm for Oralytics, a mobile health study aiming to improve users' tooth-brushing behaviors through the personalized delivery of intervention messages. Oralytics will go into the field in late 2022.  ( 2 min )
    A Primal-Dual Approach to Bilevel Optimization with Multiple Inner Minima. (arXiv:2203.01123v2 [math.OC] UPDATED)
    Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, such a problem with multiple inner minimal points remains to be challenging and open. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with a full guarantee of convergence. In this paper, we adopt a reformulation of bilevel optimization to constrained optimization, and solve the problem via a primal-dual bilevel optimization (PDBO) algorithm. PDBO not only addresses the multiple inner minima challenge, but also features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms. We further characterize the convergence rate of PDBO, which serves as the first known non-asymptotic convergence guarantee for bilevel optimization with multiple inner minima. Our experiments demonstrate desired performance of the proposed approach.  ( 2 min )
    Using Mixed-Effect Models to Learn Bayesian Networks from Related Data Sets. (arXiv:2206.03743v1 [stat.ML])
    We commonly assume that data are a homogeneous set of observations when learning the structure of Bayesian networks. However, they often comprise different data sets that are related but not homogeneous because they have been collected in different ways or from different populations. In our previous work (Azzimonti, Corani and Scutari, 2021), we proposed a closed-form Bayesian Hierarchical Dirichlet score for discrete data that pools information across related data sets to learn a single encompassing network structure, while taking into account the differences in their probabilistic structures. In this paper, we provide an analogous solution for learning a Bayesian network from continuous data using mixed-effects models to pool information across the related data sets. We study its structural, parametric, predictive and classification accuracy and we show that it outperforms both conditional Gaussian Bayesian networks (that do not perform any pooling) and classical Gaussian Bayesian networks (that disregard the heterogeneous nature of the data). The improvement is marked for low sample sizes and for unbalanced data sets.
    A Unified Convergence Theorem for Stochastic Optimization Methods. (arXiv:2206.03907v1 [math.OC])
    In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under more general settings. Moreover, we establish new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. These applications reveal that our unified theorem provides a plugin-type convergence analysis and strong convergence guarantees for a wide class of stochastic optimization methods.
    Stabilizing Voltage in Power Distribution Networks via Multi-Agent Reinforcement Learning with Transformer. (arXiv:2206.03721v1 [cs.MA])
    The increased integration of renewable energy poses a slew of technical challenges for the operation of power distribution networks. Among them, voltage fluctuations caused by the instability of renewable energy are receiving increasing attention. Utilizing MARL algorithms to coordinate multiple control units in the grid, which is able to handle rapid changes of power systems, has been widely studied in active voltage control task recently. However, existing approaches based on MARL ignore the unique nature of the grid and achieve limited performance. In this paper, we introduce the transformer architecture to extract representations adapting to power network problems and propose a Transformer-based Multi-Agent Actor-Critic framework (T-MAAC) to stabilize voltage in power distribution networks. In addition, we adopt a novel auxiliary-task training process tailored to the voltage control task, which improves the sample efficiency and facilitating the representation learning of the transformer-based model. We couple T-MAAC with different multi-agent actor-critic algorithms, and the consistent improvements on the active voltage control task demonstrate the effectiveness of the proposed method.
    Subject Granular Differential Privacy in Federated Learning. (arXiv:2206.03617v1 [cs.LG])
    This paper introduces subject granular privacy in the Federated Learning (FL) setting, where a subject is an individual whose private information is embodied by several data items either confined within a single federation user or distributed across multiple federation users. We formally define the notion of subject level differential privacy for FL. We propose three new algorithms that enforce subject level DP. Two of these algorithms are based on notions of user level local differential privacy (LDP) and group differential privacy respectively. The third algorithm is based on a novel idea of hierarchical gradient averaging (HiGradAvgDP) for subjects participating in a training mini-batch. We also introduce horizontal composition of privacy loss for a subject across multiple federation users. We show that horizontal composition is equivalent to sequential composition in the worst case. We prove the subject level DP guarantee for all our algorithms and empirically analyze them using the FEMNIST and Shakespeare datasets. Our evaluation shows that, of our three algorithms, HiGradAvgDP delivers the best model performance, approaching that of a model trained using a DP-SGD based algorithm that provides a weaker item level privacy guarantee.  ( 2 min )
    Metric Based Few-Shot Graph Classification. (arXiv:2206.03695v1 [cs.LG])
    Many modern deep-learning techniques do not work without enormous datasets. At the same time, several fields demand methods working in scarcity of data. This problem is even more complex when the samples have varying structures, as in the case of graphs. Graph representation learning techniques have recently proven successful in a variety of domains. Nevertheless, the employed architectures perform miserably when faced with data scarcity. On the other hand, few-shot learning allows employing modern deep learning models in scarce data regimes without waiving their effectiveness. In this work, we tackle the problem of few-shot graph classification, showing that equipping a simple distance metric learning baseline with a state-of-the-art graph embedder allows to obtain competitive results on the task.While the simplicity of the architecture is enough to outperform more complex ones, it also allows straightforward additions. To this end, we show that additional improvements may be obtained by encouraging a task-conditioned embedding space. Finally, we propose a MixUp-based online data augmentation technique acting in the latent space and show its effectiveness on the task.  ( 2 min )
    Solving the Spike Feature Information Vanishing Problem in Spiking Deep Q Network with Potential Based Normalization. (arXiv:2206.03654v1 [cs.NE])
    Brain inspired spiking neural networks (SNNs) have been successfully applied to many pattern recognition domains. The SNNs based deep structure have achieved considerable results in perceptual tasks, such as image classification, target detection. However, the application of deep SNNs in reinforcement learning (RL) tasks is still a problem to be explored. Although there have been previous studies on the combination of SNNs and RL, most of them focus on robotic control problems with shallow networks or using ANN-SNN conversion method to implement spiking deep Q Network (SDQN). In this work, we mathematically analyzed the problem of the disappearance of spiking signal features in SDQN and proposed a potential based layer normalization(pbLN) method to directly train spiking deep Q networks. Experiment shows that compared with state-of-art ANN-SNN conversion method and other SDQN works, the proposed pbLN spiking deep Q networks (PL-SDQN) achieved better performance on Atari game tasks.  ( 2 min )
    EiX-GNN : Concept-level eigencentrality explainer for graph neural networks. (arXiv:2206.03491v1 [cs.AI])
    Explaining is a human knowledge transfer process regarding a phenomenon between an explainer and an explainee. Each word used to explain this phenomenon must be carefully selected by the explainer in accordance with the current explainee phenomenon-related knowledge level and the phenomenon itself in order to have a high understanding from the explainee of the phenomenon. Nowadays, deep models, especially graph neural networks, have a major place in daily life even in critical applications. In such context, those models need to have a human high interpretability also referred as being explainable, in order to improve usage trustability of them in sensitive cases. Explaining is also a human dependent task and methods that explain deep model behavior must include these social-related concerns for providing profitable and quality explanations. Current explaining methods often occlude such social aspect for providing their explanations and only focus on the signal aspect of the question. In this contribution we propose a reliable social-aware explaining method suited for graph neural network that includes this social feature as a modular concept generator and by both leveraging signal and graph domain aspect thanks to an eigencentrality concept ordering approach. Besides our method takes into account the human-dependent aspect underlying any explanation process, we also reach high score regarding state-of-the-art objective metrics assessing explanation methods for graph neural networks models.  ( 2 min )
    Autoregressive Perturbations for Data Poisoning. (arXiv:2206.03693v1 [cs.LG])
    The prevalence of data scraping from social media as a means to obtain datasets has led to growing concerns regarding unauthorized use of data. Data poisoning attacks have been proposed as a bulwark against scraping, as they make data "unlearnable" by adding small, imperceptible perturbations. Unfortunately, existing methods require knowledge of both the target architecture and the complete dataset so that a surrogate network can be trained, the parameters of which are used to generate the attack. In this work, we introduce autoregressive (AR) poisoning, a method that can generate poisoned data without access to the broader dataset. The proposed AR perturbations are generic, can be applied across different datasets, and can poison different architectures. Compared to existing unlearnable methods, our AR poisons are more resistant against common defenses such as adversarial training and strong data augmentations. Our analysis further provides insight into what makes an effective data poison.  ( 2 min )
    Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping. (arXiv:2206.03633v1 [cs.LG])
    In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions. A common approach to uncertainty estimation maintains an ensemble of models. In recent years, several approaches have been proposed for training ensembles, and conflicting views prevail with regards to the importance of various ingredients of these approaches. In this paper, we aim to address the benefits of two ingredients -- prior functions and bootstrapping -- which have come into question. We show that prior functions can significantly improve an ensemble agent's joint predictions across inputs and that bootstrapping affords additional benefits if the signal-to-noise ratio varies across inputs. Our claims are justified by both theoretical and experimental results.  ( 2 min )
    A Penny for Your (visual) Thoughts: Self-Supervised Reconstruction of Natural Movies from Brain Activity. (arXiv:2206.03544v1 [cs.CV])
    Reconstructing natural videos from fMRI brain recordings is very challenging, for two main reasons: (i) As fMRI data acquisition is diffcult, we only have a limited amount of supervised samples, which is not enough to cover the huge space of natural videos; and (ii) The temporal resolution of fMRI recordings is much lower than the frame rate of natural videos. In this paper, we propose a selfsupervised approach for natural movie reconstruction. By employing cycle consistency over Encoding-Decoding natural videos, we can: (i) exploit the full framerate of the training videos, and not be limited only to clips that correspond to fMRI recordings; (ii) exploit massive amounts of external natural videos which the subjects never saw inside the fMRI machine. These enable increasing the applicable training data by several orders of magnitude, introducing natural video priors to the decoding network, as well as temporal coherence. Our approach signifcantly outperforms competing methods, since those train only on the limited supervised data. We further introduce a new and simple temporal prior of natural videos, which when folded into our fMRI decoder further allows us to reconstruct videos at a higher framerate (HFR) of up to x8 of the original fMRI sample rate.  ( 2 min )
    Towards Scalable Hyperbolic Neural Networks using Taylor Series Approximations. (arXiv:2206.03610v1 [cs.LG])
    Hyperbolic networks have shown prominent improvements over their Euclidean counterparts in several areas involving hierarchical datasets in various domains such as computer vision, graph analysis, and natural language processing. However, their adoption in practice remains restricted due to (i) non-scalability on accelerated deep learning hardware, (ii) vanishing gradients due to the closure of hyperbolic space, and (iii) information loss due to frequent mapping between local tangent space and fully hyperbolic space. To tackle these issues, we propose the approximation of hyperbolic operators using Taylor series expansions, which allows us to reformulate the computationally expensive tangent and cosine hyperbolic functions into their polynomial equivariants which are more efficient. This allows us to retain the benefits of preserving the hierarchical anatomy of the hyperbolic space, while maintaining the scalability over current accelerated deep learning infrastructure. The polynomial formulation also enables us to utilize the advancements in Euclidean networks such as gradient clipping and ReLU activation to avoid vanishing gradients and remove errors due to frequent switching between tangent space and hyperbolic space. Our empirical evaluation on standard benchmarks in the domain of graph analysis and computer vision shows that our polynomial formulation is as scalable as Euclidean architectures, both in terms of memory and time complexity, while providing results as effective as hyperbolic models. Moreover, our formulation also shows a considerable improvement over its baselines due to our solution to vanishing gradients and information loss.  ( 2 min )
    Asymptotic Stability in Reservoir Computing. (arXiv:2206.03854v1 [cs.NE])
    Reservoir Computing is a class of Recurrent Neural Networks with internal weights fixed at random. Stability relates to the sensitivity of the network state to perturbations. It is an important property in Reservoir Computing as it directly impacts performance. In practice, it is desirable to stay in a stable regime, where the effect of perturbations does not explode exponentially, but also close to the chaotic frontier where reservoir dynamics are rich. Open questions remain today regarding input regularization and discontinuous activation functions. In this work, we use the recurrent kernel limit to draw new insights on stability in reservoir computing. This limit corresponds to large reservoir sizes, and it already becomes relevant for reservoirs with a few hundred neurons. We obtain a quantitative characterization of the frontier between stability and chaos, which can greatly benefit hyperparameter tuning. In a broader sense, our results contribute to understanding the complex dynamics of Recurrent Neural Networks.  ( 2 min )
    DeepCAVE: An Interactive Analysis Tool for Automated Machine Learning. (arXiv:2206.03493v1 [cs.LG])
    Automated Machine Learning (AutoML) is used more than ever before to support users in determining efficient hyperparameters, neural architectures, or even full machine learning pipelines. However, users tend to mistrust the optimization process and its results due to a lack of transparency, making manual tuning still widespread. We introduce DeepCAVE, an interactive framework to analyze and monitor state-of-the-art optimization procedures for AutoML easily and ad hoc. By aiming for full and accessible transparency, DeepCAVE builds a bridge between users and AutoML and contributes to establishing trust. Our framework's modular and easy-to-extend nature provides users with automatically generated text, tables, and graphic visualizations. We show the value of DeepCAVE in an exemplary use-case of outlier detection, in which our framework makes it easy to identify problems, compare multiple runs and interpret optimization processes. The package is freely available on GitHub https://github.com/automl/DeepCAVE.  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v1 [cs.LG])
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Towards Practical Differential Privacy in Data Analysis: Understanding the Effect of Epsilon on Utility in Private ERM. (arXiv:2206.03488v1 [cs.CR])
    In this paper, we focus our attention on private Empirical Risk Minimization (ERM), which is one of the most commonly used data analysis method. We take the first step towards solving the above problem by theoretically exploring the effect of epsilon (the parameter of differential privacy that determines the strength of privacy guarantee) on utility of the learning model. We trace the change of utility with modification of epsilon and reveal an established relationship between epsilon and utility. We then formalize this relationship and propose a practical approach for estimating the utility under an arbitrary value of epsilon. Both theoretical analysis and experimental results demonstrate high estimation accuracy and broad applicability of our approach in practical applications. As providing algorithms with strong utility guarantees that also give privacy when possible becomes more and more accepted, our approach would have high practical value and may be likely to be adopted by companies and organizations that would like to preserve privacy but are unwilling to compromise on utility.  ( 2 min )
  • Open

    Error Rates for Kernel Classification under Source and Capacity Conditions. (arXiv:2201.12655v2 [stat.ML] UPDATED)
    We consider the problem of kernel classification. Works on kernel regression have shown that the rate of decay of the prediction error with the number of samples for a large class of data-sets is well characterized by two quantities: the capacity and source of the data-set. In this work, we compute the decay rates for the misclassification (prediction) error under the Gaussian design, for data-sets satisfying source and capacity assumptions. We derive the rates as a function of the source and capacity coefficients for two standard kernel classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification, and contrast the two methods. As a consequence, we find that the known worst-case rates are loose for this class of data-sets. Finally, we show that the rates presented in this work are also observed on real data-sets.  ( 2 min )
    Neural Bandit with Arm Group Graph. (arXiv:2206.03644v1 [cs.LG])
    Contextual bandits aim to identify among a set of arms the optimal one with the highest reward based on their contextual information. Motivated by the fact that the arms usually exhibit group behaviors and the mutual impacts exist among groups, we introduce a new model, Arm Group Graph (AGG), where the nodes represent the groups of arms and the weighted edges formulate the correlations among groups. To leverage the rich information in AGG, we propose a bandit algorithm, AGG-UCB, where the neural networks are designed to estimate rewards, and we propose to utilize graph neural networks (GNN) to learn the representations of arm groups with correlations. To solve the exploitation-exploration dilemma in bandits, we derive a new upper confidence bound (UCB) built on neural networks (exploitation) for exploration. Furthermore, we prove that AGG-UCB can achieve a near-optimal regret bound with over-parameterized neural networks, and provide the convergence analysis of GNN with fully-connected layers which may be of independent interest. In the end, we conduct extensive experiments against state-of-the-art baselines on multiple public data sets, showing the effectiveness of the proposed algorithm.  ( 2 min )
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v12 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of time-dependent covariates and treatments. Using simulated survival datasets, the TCS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In a large clinical cohort study, TCS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. TCS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions. The use of a propensity score layer and potential outcome subnetworks helps correcting for selection bias. However, the proposed model is limited in its ability to correct the bias from unmeasured confounding, and more extensive testing of TCS under extreme scenarios such as low overlapping and the presence of unmeasured confounders is desired and left for future work.
    Resolving the Human Subjects Status of Machine Learning's Crowdworkers. (arXiv:2206.04039v1 [cs.CY])
    In recent years, machine learning (ML) has come to rely more heavily on crowdworkers, both for building bigger datasets and for addressing research questions requiring human interaction or judgment. Owing to the diverse tasks performed by crowdworkers, and the myriad ways the resulting datasets are used, it can be difficult to determine when these individuals are best thought of as workers, versus as human subjects. These difficulties are compounded by conflicting policies, with some institutions and researchers treating all ML crowdwork as human subjects research, and other institutions holding that ML crowdworkers rarely constitute human subjects. Additionally, few ML papers involving crowdwork mention IRB oversight, raising the prospect that many might not be in compliance with ethical and regulatory requirements. In this paper, we focus on research in natural language processing to investigate the appropriate designation of crowdsourcing studies and the unique challenges that ML research poses for research oversight. Crucially, under the U.S. Common Rule, these judgments hinge on determinations of "aboutness", both whom (or what) the collected data is about and whom (or what) the analysis is about. We highlight two challenges posed by ML: (1) the same set of workers can serve multiple roles and provide many sorts of information; and (2) compared to the life sciences and social sciences, ML research tends to embrace a dynamic workflow, where research questions are seldom stated ex ante and data sharing opens the door for future studies to ask questions about different targets from the original study. In particular, our analysis exposes a potential loophole in the Common Rule, where researchers can elude research ethics oversight by splitting data collection and analysis into distinct studies. We offer several policy recommendations to address these concerns.
    Fairness-Aware PAC Learning from Corrupted Data. (arXiv:2102.06004v3 [cs.LG] UPDATED)
    Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data limit.
    Estimation of Predictive Performance in High-Dimensional Data Settings using Learning Curves. (arXiv:2206.03825v1 [stat.ME])
    In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.
    Decoupled Self-supervised Learning for Non-Homophilous Graphs. (arXiv:2206.03601v1 [cs.LG])
    In this paper, we study the problem of conducting self-supervised learning for node representation learning on non-homophilous graphs. Existing self-supervised learning methods typically assume the graph is homophilous where linked nodes often belong to the same class or have similar features. However, such assumptions of homophily do not always hold true in real-world graphs. We address this problem by developing a decoupled self-supervised learning (DSSL) framework for graph neural networks. DSSL imitates a generative process of nodes and links from latent variable modeling of the semantic structure, which decouples different underlying semantics between different neighborhoods into the self-supervised node learning process. Our DSSL framework is agnostic to the encoders and does not need prefabricated augmentations, thus is flexible to different graphs. To effectively optimize the framework with latent variables, we derive the evidence lower-bound of the self-supervised objective and develop a scalable training algorithm with variational inference. We provide a theoretical analysis to justify that DSSL enjoys better downstream performance. Extensive experiments on various types of graph benchmarks demonstrate that our proposed framework can significantly achieve better performance compared with competitive self-supervised learning baselines.
    Classification of Stochastic Processes with Topological Data Analysis. (arXiv:2206.03973v1 [stat.ML])
    In this study, we examine if engineered topological features can distinguish time series sampled from different stochastic processes with different noise characteristics, in both balanced and unbalanced sampling schemes. We compare our classification results against the results of the same classification tasks built on statistical and raw features. We conclude that in classification tasks of time series, different machine learning models built on engineered topological features perform consistently better than those built on standard statistical and raw features.
    Out-of-Distribution Detection with Class Ratio Estimation. (arXiv:2206.03955v1 [stat.ML])
    Density-based Out-of-distribution (OOD) detection has recently been shown unreliable for the task of detecting OOD images. Various density ratio based approaches achieve good empirical performance, however methods typically lack a principled probabilistic modelling explanation. In this work, we propose to unify density ratio based methods under a novel framework that builds energy-based models and employs differing base distributions. Under our framework, the density ratio can be viewed as the unnormalized density of an implicit semantic distribution. Further, we propose to directly estimate the density ratio of a data sample through class ratio estimation. We report competitive results on OOD image problems in comparison with recent work that alternatively requires training of deep generative models for the task. Our approach enables a simple and yet effective path towards solving the OOD detection problem.
    Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations. (arXiv:2008.02965v2 [cs.LG] UPDATED)
    Using weight decay to penalize the L2 norms of weights in neural networks has been a standard training practice to regularize the complexity of networks. In this paper, we show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with positively homogeneous activation functions, such as linear, ReLU and max-pooling functions. As a result of homogeneity, functions specified by the networks are invariant to the shifting of weight scales between layers. The ineffective regularizers are sensitive to such shifting and thus poorly regularize the model capacity, leading to overfitting. To address this shortcoming, we propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network. The derived regularizer is an upper bound for the input gradient of the network so minimizing the improved regularizer also benefits the adversarial robustness. Residual connections are also considered and we show that our regularizer also forms an upper bound to input gradients of such a residual network. We demonstrate the efficacy of our proposed regularizer on various datasets and neural network architectures at improving generalization and adversarial robustness.
    Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits. (arXiv:2206.03520v1 [stat.ML])
    We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a $K$-armed bandit with exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.
    Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping. (arXiv:2206.03633v1 [cs.LG])
    In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions. A common approach to uncertainty estimation maintains an ensemble of models. In recent years, several approaches have been proposed for training ensembles, and conflicting views prevail with regards to the importance of various ingredients of these approaches. In this paper, we aim to address the benefits of two ingredients -- prior functions and bootstrapping -- which have come into question. We show that prior functions can significantly improve an ensemble agent's joint predictions across inputs and that bootstrapping affords additional benefits if the signal-to-noise ratio varies across inputs. Our claims are justified by both theoretical and experimental results.
    Probabilistically Robust Learning: Balancing Average- and Worst-case Performance. (arXiv:2202.01136v3 [cs.LG] UPDATED)
    Many of the successes of machine learning are based on minimizing an averaged loss function. However, it is well-known that this paradigm suffers from robustness issues that hinder its applicability in safety-critical domains. These issues are often addressed by training against worst-case perturbations of data, a technique known as adversarial training. Although empirically effective, adversarial training can be overly conservative, leading to unfavorable trade-offs between nominal performance and robustness. To this end, in this paper we propose a framework called probabilistic robustness that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case by enforcing robustness to most rather than to all perturbations. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning. From a practical point of view, we propose a novel algorithm based on risk-aware optimization that effectively balances average- and worst-case performance at a considerably lower computational cost relative to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate the advantages of this framework on the spectrum from average- to worst-case robustness.
    Using Mixed-Effect Models to Learn Bayesian Networks from Related Data Sets. (arXiv:2206.03743v1 [stat.ML])
    We commonly assume that data are a homogeneous set of observations when learning the structure of Bayesian networks. However, they often comprise different data sets that are related but not homogeneous because they have been collected in different ways or from different populations. In our previous work (Azzimonti, Corani and Scutari, 2021), we proposed a closed-form Bayesian Hierarchical Dirichlet score for discrete data that pools information across related data sets to learn a single encompassing network structure, while taking into account the differences in their probabilistic structures. In this paper, we provide an analogous solution for learning a Bayesian network from continuous data using mixed-effects models to pool information across the related data sets. We study its structural, parametric, predictive and classification accuracy and we show that it outperforms both conditional Gaussian Bayesian networks (that do not perform any pooling) and classical Gaussian Bayesian networks (that disregard the heterogeneous nature of the data). The improvement is marked for low sample sizes and for unbalanced data sets.
    Neural Diffusion Processes. (arXiv:2206.03992v1 [stat.ML])
    Gaussian processes provide an elegant framework for specifying prior and posterior distributions over functions. They are, however, also computationally expensive, and limited by the expressivity of their covariance function. We propose Neural Diffusion Processes (NDPs), a novel approach based upon diffusion models, that learn to sample from distributions over functions. Using a novel attention block, we can incorporate properties of stochastic processes, such as exchangeability, directly into the NDP's architecture. We empirically show that NDPs are able to capture functional distributions that are close to the true Bayesian posterior of a Gaussian process. This enables a variety of downstream tasks, including hyperparameter marginalisation and Bayesian optimisation.
    Inverse Contextual Bandits: Learning How Behavior Evolves over Time. (arXiv:2107.06317v3 [cs.LG] UPDATED)
    Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over time. For instance, as the medical community's understanding of organ transplantations has progressed over the years, a pertinent question is: How have actual organ allocation policies been evolving? To give an answer, we desire a policy learning method that provides interpretable representations of decision-making, in particular capturing an agent's non-stationary knowledge of the world, as well as operating in an offline manner. First, we model the evolving behavior of decision-makers in terms of contextual bandits, and formalize the problem of Inverse Contextual Bandits (ICB). Second, we propose two concrete algorithms as solutions, learning parametric and nonparametric representations of an agent's behavior. Finally, using both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as benchmarking and validating its accuracy.
    $p$-Sparsified Sketches for Fast Multiple Output Kernel Methods. (arXiv:2206.03827v1 [stat.ML])
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, that consists in looking for solutions among a subspace of reduced dimension, is a widely studied approach to alleviate this numerical burden. However, fast sketching strategies, such as non-adaptive subsampling, significantly degrade the guarantees of the algorithms, while theoretically-accurate sketches, such as the Gaussian one, turn out to remain relatively slow in practice. In this paper, we introduce the $p$-sparsified sketches, that combine the benefits from both approaches to achieve a good tradeoff between statistical accuracy and computational efficiency. To support our method, we derive excess risk bounds for both single and multiple output problems, with generic Lipschitz losses, providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. We also provide empirical evidences of the superiority of our sketches over recent SOTA approaches.
    An Information-Theoretic Framework for Supervised Learning. (arXiv:2203.00246v5 [cs.LG] UPDATED)
    Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learning under a different lens. In this paper, we propose a novel information-theoretic framework with its own notions of regret and sample complexity for analyzing the data requirements of machine learning. With our framework, we first work through some classical examples such as scalar estimation and linear regression to build intuition and introduce general techniques. Then, we use the framework to study the sample complexity of learning from data generated by deep sign neural networks, deep ReLU neural networks, and deep networks that are infinitely wide but have a bounded sum of weights. For sign neural networks, we recover sample-complexity bounds that follow from VC-dimension based arguments. For the latter two neural network environments, we establish new results that suggest that the sample complexity of learning under these data generating processes is at most linear and quadratic, respectively, in network depth.
    Modeling Disagreement in Automatic Data Labelling for Semi-Supervised Learning in Clinical Natural Language Processing. (arXiv:2205.14761v2 [cs.LG] UPDATED)
    Computational models providing accurate estimates of their uncertainty are crucial for risk management associated with decision making in healthcare contexts. This is especially true since many state-of-the-art systems are trained using the data which has been labelled automatically (self-supervised mode) and tend to overfit. In this work, we investigate the quality of uncertainty estimates from a range of current state-of-the-art predictive models applied to the problem of observation detection in radiology reports. This problem remains understudied for Natural Language Processing in the healthcare domain. We demonstrate that Gaussian Processes (GPs) provide superior performance in quantifying the risks of 3 uncertainty labels based on the negative log predictive probability (NLPP) evaluation metric and mean maximum predicted confidence levels (MMPCL), whilst retaining strong predictive performance.
    Boosting the Confidence of Generalization for $L_2$-Stable Randomized Learning Algorithms. (arXiv:2206.03834v1 [stat.ML])
    Exponential generalization bounds with near-tight rates have recently been established for uniformly stable learning algorithms. The notion of uniform stability, however, is stringent in the sense that it is invariant to the data-generating distribution. Under the weaker and distribution dependent notions of stability such as hypothesis stability and $L_2$-stability, the literature suggests that only polynomial generalization bounds are possible in general cases. The present paper addresses this long standing tension between these two regimes of results and makes progress towards relaxing it inside a classic framework of confidence-boosting. To this end, we first establish an in-expectation first moment generalization error bound for potentially randomized learning algorithms with $L_2$-stability, based on which we then show that a properly designed subbagging process leads to near-tight exponential generalization bounds over the randomness of both data and algorithm. We further substantialize these generic results to stochastic gradient descent (SGD) to derive improved high-probability generalization bounds for convex or non-convex optimization problems with natural time decaying learning rates, which have not been possible to prove with the existing hypothesis stability or uniform stability based results.
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v1 [cs.LG])
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task. Specifically, we assume that pretraining dataset contains multi-view samples of ratio $1-\mu$ and single-view samples of ratio $\mu$, where multi/single-view samples has multiple/single discriminative semantics. Then for pretraining, we prove that 1) the convolution kernels of the MRP encoder captures all discriminative semantics in the pretraining data; and 2) a convolution kernel captures at most one semantic. Accordingly, in the downstream supervised fine-tuning, most semantics would be captured and different semantics would not be fused together. This helps the downstream fine-tuned network to easily establish the relation between kernels and semantic class labels. In this way, the fine-tuned encoder in MRP provably achieves zero test error with high probability for both multi-view and single-view test data. In contrast, as proved by~[3], conventional SL can only obtain a test accuracy between around $0.5\mu$ for single-view test data. These results together explain the benefits of MRP in downstream tasks. Experimental results testify to multi-view data assumptions and our theoretical implications.
    How unfair is private learning ?. (arXiv:2206.03985v1 [cs.LG])
    As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision~(\cifar and CelebA), and tabular~(Law School) datasets and learning algorithms.
    Between Stochastic and Adversarial Online Convex Optimization: Improved Regret Bounds via Smoothness. (arXiv:2202.07554v2 [cs.LG] UPDATED)
    Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing, for example, adversarially poisoned rounds, which were previously considered in the expert and bandit setting. Our results extend this to the online convex optimization framework. In the fully i.i.d. case, our bounds match the rates one would expect from results in stochastic acceleration, and in the fully adversarial case they gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes in terms of the stochastic variance and the adversarial variation of the loss gradients.
    Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions. (arXiv:2108.11328v2 [stat.ML] UPDATED)
    Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.
    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. (arXiv:2206.03688v1 [cs.LG])
    A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
    Federated Learning Algorithms for Generalized Mixed-effects Model (GLMM) on Horizontally Partitioned Data from Distributed Sources. (arXiv:2109.14046v2 [stat.ML] UPDATED)
    Objectives: This paper develops two algorithms to achieve federated generalized linear mixed effect models (GLMM), and compares the developed model's outcomes with each other, as well as that from the standard R package (`lme4'). Methods: The log-likelihood function of GLMM is approximated by two numerical methods (Laplace approximation and Gaussian Hermite approximation), which supports federated decomposition of GLMM to bring computation to data. Results: Our developed method can handle GLMM to accommodate hierarchical data with multiple non-independent levels of observations in a federated setting. The experiment results demonstrate comparable (Laplace) and superior (Gaussian-Hermite) performances with simulated and real-world data. Conclusion: We developed and compared federated GLMMs with different approximations, which can support researchers in analyzing biomedical data to accommodate mixed effects and address non-independence due to hierarchical structures (i.e., institutes, region, country, etc.).
    Predictions of Electromotive Force of Magnetic Shape Memory Alloy (MSMA) Using Constitutive Model and Generalized Regression Neural Network. (arXiv:2206.03701v1 [cond-mat.mtrl-sci])
    Ferromagnetic shape memory alloys (MSMAs), such as Ni-Mn-Ga single crystals, can exhibit the shape memory effect due to an applied magnetic field at room temperature. Under a variable magnetic field and a constant bias stress loading, MSMAs have been used for actuation applications. This work introduced a new feature to the existing macroscale magneto-mechanical model for Ni-Mn-Ga single crystal. This model includes the fact that the magnetic easy axis in the two variants is not exactly perpendicular as observed by D silva et al. This offset helps explain some of the power harvesting capabilities of MSMAs. Model predictions are compared to experimental data collected on a Ni-Mn-Ga single crystal. The experiments include both stress-controlled loading with constant bias magnetic field load (which mimics power harvesting or sensing) and fieldcontrolled loading with constant bias compressive stress (which mimics actuation). Each type of test was performed at several different load levels, and the applied field was measured without the MSMA specimen present so that demagnetization does not affect the experimentally measured field as suggested by Eberle et al. Results show decent agreement between model predictions and experimental data. Although the model predicts experimental results decently, it does not capture all the features of the experimental data. In order to capture all the experimental features, finally, a generalized regression neural network (GRNN) was used to train the experimental data (stress, strain, magnetic field, and emf) so that it can make a reasonably better prediction.
    Inferring Lexicographically-Ordered Rewards from Preferences. (arXiv:2202.10153v2 [cs.LG] UPDATED)
    Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning.
    A Primal-Dual Approach to Bilevel Optimization with Multiple Inner Minima. (arXiv:2203.01123v2 [math.OC] UPDATED)
    Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, such a problem with multiple inner minimal points remains to be challenging and open. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with a full guarantee of convergence. In this paper, we adopt a reformulation of bilevel optimization to constrained optimization, and solve the problem via a primal-dual bilevel optimization (PDBO) algorithm. PDBO not only addresses the multiple inner minima challenge, but also features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms. We further characterize the convergence rate of PDBO, which serves as the first known non-asymptotic convergence guarantee for bilevel optimization with multiple inner minima. Our experiments demonstrate desired performance of the proposed approach.  ( 2 min )
    High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. (arXiv:2206.04030v1 [stat.ML])
    We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. Interestingly, we find a critical scaling regime for the step-size below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations.
    Model-Based Reinforcement Learning Is Minimax-Optimal for Offline Zero-Sum Markov Games. (arXiv:2206.04044v1 [cs.LG])
    This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $\gamma$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v1 [cs.LG])
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.
    Attribution of Predictive Uncertainties in Classification Models. (arXiv:2107.08756v3 [cs.LG] UPDATED)
    Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity.
    Decentralized Online Regularized Learning Over Random Time-Varying Graphs. (arXiv:2206.03861v1 [cs.LG])
    We study the decentralized online regularized linear regression algorithm over random time-varying graphs. At each time step, every node runs an online estimation algorithm consisting of an innovation term processing its own new measurement, a consensus term taking a weighted sum of estimations of its own and its neighbors with additive and multiplicative communication noises and a regularization term preventing over-fitting. It is not required that the regression matrices and graphs satisfy special statistical assumptions such as mutual independence, spatio-temporal independence or stationarity. We develop the nonnegative supermartingale inequality of the estimation error, and prove that the estimations of all nodes converge to the unknown true parameter vector almost surely if the algorithm gains, graphs and regression matrices jointly satisfy the sample path spatio-temporal persistence of excitation condition. Especially, this condition holds by choosing appropriate algorithm gains if the graphs are uniformly conditionally jointly connected and conditionally balanced, and the regression models of all nodes are uniformly conditionally spatio-temporally jointly observable, under which the algorithm converges in mean square and almost surely. In addition, we prove that the regret upper bound $\mathcal O(T^{1-\tau}\ln T)$, where $\tau\in (0.5,1)$ is a constant depending on the algorithm gains.
    Asymptotic Stability in Reservoir Computing. (arXiv:2206.03854v1 [cs.NE])
    Reservoir Computing is a class of Recurrent Neural Networks with internal weights fixed at random. Stability relates to the sensitivity of the network state to perturbations. It is an important property in Reservoir Computing as it directly impacts performance. In practice, it is desirable to stay in a stable regime, where the effect of perturbations does not explode exponentially, but also close to the chaotic frontier where reservoir dynamics are rich. Open questions remain today regarding input regularization and discontinuous activation functions. In this work, we use the recurrent kernel limit to draw new insights on stability in reservoir computing. This limit corresponds to large reservoir sizes, and it already becomes relevant for reservoirs with a few hundred neurons. We obtain a quantitative characterization of the frontier between stability and chaos, which can greatly benefit hyperparameter tuning. In a broader sense, our results contribute to understanding the complex dynamics of Recurrent Neural Networks.
    On gradient descent training under data augmentation with on-line noisy copies. (arXiv:2206.03734v1 [stat.ML])
    In machine learning, data augmentation (DA) is a technique for improving the generalization performance. In this paper, we mainly considered gradient descent of linear regression under DA using noisy copies of datasets, in which noise is injected into inputs. We analyzed the situation where random noisy copies are newly generated and used at each epoch; i.e., the case of using on-line noisy copies. Therefore, it is viewed as an analysis on a method using noise injection into training process by DA manner; i.e., on-line version of DA. We derived the averaged behavior of training process under three situations which are the full-batch training under the sum of squared errors, the full-batch and mini-batch training under the mean squared error. We showed that, in all cases, training for DA with on-line copies is approximately equivalent to a ridge regression training whose regularization parameter corresponds to the variance of injected noise. On the other hand, we showed that the learning rate is multiplied by the number of noisy copies plus one in full-batch under the sum of squared errors and the mini-batch under the mean squared error; i.e., DA with on-line copies yields apparent acceleration of training. The apparent acceleration and regularization effect come from the original part and noise in a copy data respectively. These results are confirmed in a numerical experiment. In the numerical experiment, we found that our result can be approximately applied to usual off-line DA in under-parameterization scenario and can not in over-parametrization scenario. Moreover, we experimentally investigated the training process of neural networks under DA with off-line noisy copies and found that our analysis on linear regression is possible to be applied to neural networks.
    Structure-Aware Transformer for Graph Representation Learning. (arXiv:2202.03036v2 [stat.ML] UPDATED)
    The Transformer architecture has gained growing attention in graph representation learning recently, as it naturally overcomes several limitations of graph neural networks (GNNs) by avoiding their strict structural inductive biases and instead only encoding the graph structure via positional encoding. Here, we show that the node representations generated by the Transformer with positional encoding do not necessarily capture structural similarity between them. To address this issue, we propose the Structure-Aware Transformer, a class of simple and flexible graph Transformers built upon a new self-attention mechanism. This new self-attention incorporates structural information into the original self-attention by extracting a subgraph representation rooted at each node before computing the attention. We propose several methods for automatically generating the subgraph representation and show theoretically that the resulting representations are at least as expressive as the subgraph representations. Empirically, our method achieves state-of-the-art performance on five graph prediction benchmarks. Our structure-aware framework can leverage any existing GNN to extract the subgraph representation, and we show that it systematically improves performance relative to the base GNN model, successfully combining the advantages of GNNs and Transformers. Our code is available at https://github.com/BorgwardtLab/SAT .
    Learning Interpretable Decision Rule Sets: A Submodular Optimization Approach. (arXiv:2206.03718v1 [cs.LG])
    Rule sets are highly interpretable logical models in which the predicates for decision are expressed in disjunctive normal form (DNF, OR-of-ANDs), or, equivalently, the overall model comprises an unordered collection of if-then decision rules. In this paper, we consider a submodular optimization based approach for learning rule sets. The learning problem is framed as a subset selection task in which a subset of all possible rules needs to be selected to form an accurate and interpretable rule set. We employ an objective function that exhibits submodularity and thus is amenable to submodular optimization techniques. To overcome the difficulty arose from dealing with the exponential-sized ground set of rules, the subproblem of searching a rule is casted as another subset selection task that asks for a subset of features. We show it is possible to write the induced objective function for the subproblem as a difference of two submodular (DS) functions to make it approximately solvable by DS optimization algorithms. Overall, the proposed approach is simple, scalable, and likely to be benefited from further research on submodular optimization. Experiments on real datasets demonstrate the effectiveness of our method.
    Logistic Regression Through the Veil of Imprecise Data. (arXiv:2106.00492v2 [stat.ME] UPDATED)
    Logistic regression is an important statistical tool for assessing the probability of an outcome based upon some predictive variables. Standard methods can only deal with precisely known data, however many datasets have uncertainties which traditional methods either reduce to a single point or completely disregarded. In this paper we show that it is possible to include these uncertainties by considering an imprecise logistic regression model using the set of possible models that can be obtained from values from within the intervals. This has the advantage of clearly expressing the epistemic uncertainty removed by traditional methods.
    An Analysis of Selection Bias Issue for Online Advertising. (arXiv:2206.03853v1 [cs.IR])
    In online advertising, a set of potential advertisements can be ranked by a certain auction system where usually the top-1 advertisement would be selected and displayed at an advertising space. In this paper, we show a selection bias issue that is present in an auction system. We analyze that the selection bias destroy truthfulness of the auction, which implies that the buyers (advertisers) on the auction can not maximize their profits. Although selection bias is well known in the field of statistics and there are lot of studies for it, our main contribution is to combine the theoretical analysis of the bias with the auction mechanism. In our experiment using online A/B testing, we evaluate the selection bias on an auction system whose ranking score is the function of predicted CTR (click through rate) of advertisement. The experiment showed that the selection bias is drastically reduced by using a multi-task learning which learns the data for all advertisements.
    Data fission: splitting a single data point. (arXiv:2112.11079v3 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2021) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

  • Open

    [P] WebtoonMe Project: Selfie to Webtoon style (you can try the demo app for free)
    https://www.reddit.com/r/MachineLearning/comments/sfbtds/p_webtoonme_project_selfie_to_webtoon_style/?utm_source=share&utm_medium=web2x&context=3 project page: https://github.com/webtoon/WebtoonMe demo page: https://webtoon.github.io/WebtoonMe/app.html submitted by /u/jis478 [link] [comments]
    [P] GPT3 generation of news stories about AI
    Here's a fun little project I did today on a whim. I happen to have access to the OpenAI API, so I used their playground feature to generate AI headlines with their taglines. I fed it this prompt (sourced from the latest edition of Last Week in AI; I co-run it, apologies for the plug): Last week's top AI news: * Caltech unit creates AI helping drones to withstand violent winds - "Caltech researchers are developing a drone with rapidly reacting artificial intelligence (AI) capacities that allow it to adapt in flight to extreme wind similar to tornado or hurricane conditions." * How Deep Squeak, an AI program with a weird name, is detecting whales - "Artificial Intelligence is booming. And now an AI program is being used to search for whales." * Ex-golf pro links with Seattle-area AI e…  ( 3 min )
    [P] Lorcan Mini robot running fast with AOgmaNeo reinforcement learning
    Hi everyone, I decided to write another blog post finally. This one is about a RL demo we gave at a local conference, involving a tiny quadruped robot that learns to scramble across the floor very quickly. It learns by first mimicking a hand-made policy, and is then trained further in the real-world. Our technology is called Sparse Predictive Hierarchies (SPH), and the library that implements it is called AOgmaNeo. It's a biologically-inspired low-compute sparse online learning system. We are also working on a GPU version of SPH again, so I also included that in the post as well. Enjoy! https://ogma.ai/2022/06/aogmaneo-lorcan-mini-robot-demo-clogmaneo/ submitted by /u/CireNeikual [link] [comments]  ( 1 min )
    [R] Reading list of #ImplicitRepresentations and #NeRF papers relating to #Robotics
    Interested in a reading list of #ImplicitRepresentations and #NeRF papers relating to #Robotics? Check out this list of papers inspired by awesome-computer-vision. https://github.com/zubair-irshad/Awesome-Implicit-NeRF-Robotics… Feel free to share with others! Contributions/Suggestions are welcome. submitted by /u/KaleidoscopeBest1569 [link] [comments]  ( 1 min )
    [D][P] Grounding language to visual observation
    Hi, In my current project, I have a language observation and a visual observation that I would like to encode, both to the same context embedding. The language observation is a description of the visual observation. The goal is to ground the language in the observation. Ultimately, I need to have one Observation Encoder and one Language Encoder that take different inputs, but both output similar context vectors. What would be a technique to make that possible ? My first idea was to learn the Observation Encoder on another task, and then teach the Language Encoder to predict the same context vector as the Observation Encoder (minimizing cross-entropy). But there may be some better approach, maybe using techniques I'm not aware of. I looked briefly into Shared Latent Spaces, but was not sure that it would fit my problem statement. Was I wrong ? Do you guys know any other method I could look into ? Thanks ! submitted by /u/Maxtoq [link] [comments]  ( 1 min )
    [D] Looking for paper on infinite stacking of hyperparam optimizers
    A few years ago, I remember seeing a paper on using optimizers to optimize optimizers. The initial premise was that if you have a model and an optimizer, you need to optimize the hyperparams of the optimizer so you can add a sort of hyperoptimizer on top. But this hyperoptimizer also has hyperparams so they then explore what happens when you start stacking more and more of these hyperoptimizers on top of each other. I believe one of the conclusions was that in the limit, model behaviour ends up being independent of the top-level choice of hyperparameters. I've been trying to find this paper again recently but haven't been able to. Would greatly appreciate any help finding it! submitted by /u/ilia10000 [link] [comments]  ( 1 min )
    [P] Real-time AR for jewelry virtual try on that looks real, done with joliGAN, based on a few 2D videos and no 3D model
    A work from us with GANs recently emerged from stealth https://www.linkedin.com/feed/update/urn:li:activity:6939837590304899072/ The hands are real, but the rings are rendered with a GAN in real-time. A first network detects where to render the ring, a second network does the rendering. There's no 3D model, it's purely 2D to 2D. ​ https://preview.redd.it/9qlgbkyeue491.png?width=1936&format=png&auto=webp&s=ceadda604db236dd3f7d8b665843e786512128b8 We thought we'd share some technical details since the underlying code, JoliGAN is Open Source, https://github.com/jolibrain/joliGAN - The GAN uses a combination of mobile ResNets with attention as a Generator, along with a projected Discriminator [1]. Depending on the stone, we sometimes use transformers as well (customized Segformers and ViT mostly). A series of additional neural networks act as semantic constraints to the space of GAN transforms. - Real-time is achieved through our full C++ Open Source backend DeepDetect, https://github.com/jolibrain/deepdetect. We use CUDA along with OpenCV and TensorRT to chain multiple models (ring detection and generator mostly), and we make sure the data remain within CUDA memory at all time. This allows us to reach ~60 FPS on 1080Ti and 20% more on average on an RTX3090. JoliGAN is a powerful tool for domain to domain adaptation, with applications to AR, dataset augmentation, and sim2real transformation mostly. Documentation is scarce as the software is essentially used by us for solving our customers' problems. But hey, it's open :) [1] https://arxiv.org/abs/2111.01007 submitted by /u/pilooch [link] [comments]  ( 1 min )
    [Discussion] Should we still fly to conferences?
    Now that COVID appears to be less of a problem in many parts of the world, conferences are gradually returning to a physical format. But something has changed: we now know that online conferences are possible. Many here have probably had a mixed (very negative?) experience with the virtual conferences. It probably hasn't yet reached its best form to foster collaboration for the worldwide research community. But what would have been almost unimaginable before 2020 has now been tested repeatedly! This brings me to my question: Should we still burn insane amounts of plane fuel to fly to the other end of the planet to present a paper/poster a come back home 3 days later? Also, as a Ph.D. student, should I refuse to attend a conference because it is too far from where I work, knowing that th…  ( 6 min )
    [Discussion] Why is the Competing Conventions Problem in Neuroevolution a problem?
    The Competing Conventions problem or Permutation problem is a problem that occurs in neuroevolution. It arises when there are more than one way to represent a network as a genotype. The competing conventions problem. [Evolving neural networks through augmenting topologies; Stanley, Miikulainen; 2002] When two different genotypes, that represent the same neural network, are recombined during crossover, the emerging offspring is likely to be damaged and missing information. The figure above visualizes the problem for a small neural network. Since the order of the three hidden nodes A, B and C has no influence on the resulting function, the network can be represented by 3! = 6 different permutations. When two of these permutations are recombined during crossover the resulting offspring is missing information. As depicted in the figure the combination of {A, B, C] and [C, B, A] will result in either [A, B, A] or [C, B, C]. Both of which lack 1/3 of the main components that both their parents had. There is also the problem, that the search space is enourmosly enlarged by all the permutations, but my question refers to the first part of the problem. ​ Why is it a problem, that the children of two genotype permutations of the same underlying neural network miss information from their parents. From my understanding, the point of crossover is also exploration, so why are these networks considered damaged, while in other situation it is considered innovation? Offspring is supposed to be different from its parents, otherwise change would only happen through mutations and be completely random. I have tried to find an explanation, but every paper just seems to see it as a given that the offspring is damaged. submitted by /u/loeffner [link] [comments]  ( 4 min )
    [D] Extracting next action from conversation
    Hello people, I have an NLP problem and I would like some pointers about how to aproach it. The problem is the following: I want to extract an action from a conversation transcript. Let's say we have a transcript of a conversation that ends in a certain decision (meet again, do this thing or send a message/email, etc.). I want to extract a sentence that summarizes the final intent of the conversation, for example, "Meet again tomorrow". I have considered different approaches for now: - Intent extraction models such as https://github.com/thuiar/textoir. My problem with this approach is that they are multi-label classifiers and usually focused on single-sentence classification "Can you get me a table?" would be assigned to the "Reservation" label. I feel that I would lose information such as "Meet at 10PM in this address." - Question answering models that answer a question such as "What will they do after the conversation?". I have the feeling that QA models are not designed for this kind of tasks. I would really appreciate some pointers such as the name of this task in the NLP field. Thanks a lot for reading my post! submitted by /u/LanverYT [link] [comments]  ( 1 min )
    [P] Featureform: Open-Source Virtual Feature Store
    Hey everyone! We’re excited to announce the open-source version of Featureform, an extensible feature store. We’ve found that existing feature stores are either too heavy and replace your existing infrastructure, or don’t handle transformations at all and simply store features. We built a feature store that’s a happy medium between the two, it orchestrates your existing infrastructure to work like a feature store. We wrote more about this in our blog post. Check out the repo: https://github.com/featureform/featureform ​ https://preview.redd.it/vwpe0uypje491.png?width=2084&format=png&auto=webp&s=f81f7447f2c35081b2ae63e885506b9187a73d7b What Is Featureform Featureform is a virtual feature store. It enables data scientists to define, manage, and serve their ML model's features. Featuref…  ( 2 min )
    Measuring distances from known objects [P]
    I am a member of a Formula Student team that is building its first autonomous race car. Our track limits are defined by cones of known size placed on each side of the road, yellow on the right-hand side and blue on the left (see Images). Naturally, we are interested in measuring our distance from them so that we can map the circuit. I want your opinion on which method would yield the most accurate results. What we are currently doing is running Yolo(v5) to extract bounding boxes and then each box goes through an additional neural network that outputs 7 keypoints of the cone (see Images) and just because we know the exact positions of these keypoints relative to each other we can then turn it into a Perspective n-Point problem. https://preview.redd.it/nzbv4m1byd491.png?width=1218&format=png&auto=webp&s=f3f605767109cfc2a8e41d9b279779c542e27b10 https://preview.redd.it/n0y28oi9yd491.png?width=200&format=png&auto=webp&s=a90057af43b3f185e6fdd4755037505eee96c0b9 submitted by /u/Commercial_Put577 [link] [comments]  ( 1 min )
    [P][N] Just launched - nebulgym, a new open-source that accelerates AI training (~1.5-2x as of now) in a few lines of code without requiring you to change your training setup
    Training always takes too long. If it takes an hour, it would be better if it took 30 minutes, or maybe 15 minutes... or just 1 minute, why not? And if you want to speed up training, the techs available usually require to increase the complexity of the training process, whether it's making trade-off in terms of accuracy or time for the developer to learn a new framework. Often times it's trial and error, playing with parameters, training recipes, or switching framework/model. That's definitely not ideal. “Fast & easy-to-use” These were keywords that motivated me to work on a new way of doing training, the library nebulgym, which now is open-source (github link). Fast Training should be fast, period. Wouldn't it be great if in the near future you could train a GPT3 from scratch on your l…  ( 3 min )
    [D] What object detectors have the capability to harness relationship between its detected boxes?
    Typical object detectors do not employ relationships within the detected boxes. No context is being involved. In my problem's case, there are two requirements that would lead to drastically better results if some form of context is formed across detected boxes. Requirement #1 It is a multi-class, but single label problem. There are N classes. But the class can only appear minimum of 0 and maximum of 1 instance. Hence, it kinda needs to know the other detections whether they have already predicted something. Requirement #2 There is some form of ordinance between the predictions based on their proximity to each other. For example, Class 4 should only appear near Class 5-6 and Class 2-3. But should not be anywhere near Class 32. Any architecture that is optimized for this kinds of object detection? submitted by /u/sarmientoj24 [link] [comments]  ( 1 min )
    [D] ML/DL computer build with PCIe 5.0 x8 lanes for RTX 3090
    I'm building my first ML/DL computer around ASUS ProArt Z690 motherboard, which has 2 PCIe 5.0 slots (x8 each) and PCIe 3.0 x16 slot. The CPU is i9-12900K, which comes with 20 PCIe lanes. Since 4 lanes will go to a single NVMe, I think the motherboard will split the two PCIe 5.0 slots into 8x lanes each. My current build is with a single RTX 3080 12GB, but I want to be able to upgrade to 2x3090Ti (or 2x4090) in the future, if needed. This article from 2018 seems to imply that DL is unaffected even when PCIe 4.0 4x are used for up to 2 GPUs. I just want to confirm that this is still considered sound advice. In other words, 2x3090 GPUs won't be throttled by PCIe 5.0 x8 lanes, which is equivalent to PCIe 4.0 x16 (see matrix below). In fact, I'm also wondering if running PCIe 5.0 even at x2 lanes each won't throttle the GPUs since the equivalent transfer rate is still PCIe 4.0 x4, as mentioned in that article. Or does the fact that the motherboard interface is PCIe 5.0 not matter since the GPU can support only up to PCIe 4.0 speeds? Any other comments on my build would be welcome: PCPartpicker. This will be an everyday computer as well to edit photos/videos and used for other analyses, hence the more powerful CPU and the NVMe, which might not matter as much for ML. ​ PCIe Lane vs Speed matrix submitted by /u/Scapius [link] [comments]  ( 7 min )
    [R] What are some interesting and mysterious open problems of generalization in ML?
    I found the generalization problems of machine learning, especially in deep learning, very attractive, I wonder what are some attractive problems nowadays. I know about the double descent problem, which I believe is quite interesting, and does not have a valid answer at this moment. I also know about the implicit inductive bias introduced by SGD, but it seems has been studied widely recently especially with the tool of NTK. I wonder what are some other interesting phenomenon like these mysteries? submitted by /u/pizzaUnderSea [link] [comments]  ( 2 min )
    [R] Differentiable Finite State Machines (Blog Post)
    submitted by /u/hardmaru [link] [comments]
    [R] Intra-agent speech permits zero-shot task acquisition
    submitted by /u/hardmaru [link] [comments]
    [R] From data to functa: Your data point is a function and you can treat it like one
    submitted by /u/hardmaru [link] [comments]  ( 1 min )
  • Open

    Avoid PyBullet collision between gripper and object
    Hello, I am developing an environment in pyBullet for RL policies and I am trying to simplify some stuff. Basically, I have a Sawyer robot that would need to grip something. Let me show you a video so I can explain the issue: https://i.redd.it/pcp2lqf67h491.gif As you can see when the gripper collides with the 'table' it closes due to the collision forces (i am assuming). However, I would like to 'disable' such a thing and make sure that the gripper doesn't move further due to external forces. How could I do this? Is there a pyBullet method to do so? Would I need to change the URDF of the robot? Thanks for the help submitted by /u/gabrigoo [link] [comments]  ( 1 min )
    [P] Lorcan Mini robot running fast with AOgmaNeo reinforcement learning
    submitted by /u/CireNeikual [link] [comments]  ( 1 min )
    How to run parallel for-loop with reinforcement learning inside? Parallelized version gives incorrect output.
    I cannot for the life of me figure out what I'm doing wrong. I'm using StableBaselines3 in Google Colab. I am trying to basically do some cross validation to search for hyperparams for a reinforcement learning model. I know that SB3 has some functions to allow parallelization of agents (multiple agents, multiprocessing), but I cannot use it because I am using a wrapper called ActionMasker, which doesn't work with the multiprocessing of SB3. To be clear: My RL agent's environment is determined by a data table (not an 'simulation" environment like a game). Basically, the code is running an outer for-loop which is supposed to shift a window along a data table, where models are trained with different parameters (inner for loop), best parameters determined, and then one model is trained on …  ( 2 min )
    should different actions have their own output slot even if not a valid action based on state?
    For my problem at every state of an episode the agent will always have two actions to choose from. One action is actually a "non-action" and the other is the "action", but depending on the state the action can mean two very different things, such that they are really two separate actions. My current line of thinking is that I do not want my model to spend any effort on trying to predict the reward for an invalid action, so I just have two outputs and try to let my model decide what action it is actually taking based on the input state. (To be clear I have about 300 inputs, and one input is a binary that defines which action is actually taken). I think that theoretically the model should be able to figure this out. Does this method have any merit? Or should I really have 3 outputs, let my model waste efforts modelling the reward for an invalid output (and just do an argmax for the valid actions), for the tradeoff of us clearly delineating the different types of actions that can be taken so that my model doesn't have to try and figure it out based on the input state. submitted by /u/Yogi_DMT [link] [comments]  ( 1 min )
    Theoretical Research in RL?
    Hello! I currently doing a course in reinforcement learning and am planning to do my master thesis in the fall term. Thus, I start to think about a topic. I definitely would like to do theoretical research without much coding. Coding for experiments is cool and fine, but the main part of work shouldn't be coding. As far as I see it now the whole topic is covered by 'hands-on coding research'. Therefore, I am now here and asking: Are there research topics in reinforcement learning which target theoretical aspects? (Convergence, Analysis of algorithms, Approximation guarantees, ...) If anyone of you has an idea or starting paper for me I would really appreciate it! submitted by /u/Insighteous [link] [comments]  ( 1 min )
    Let’s learn about Deep Q-Learning by training our agent to play Space Invaders (Deep Reinforcement Learning Free Class by Hugging Face 🤗)
    Hey there! We just published the third Unit of Deep Reinforcement Learning Class 🥳. In this Unit, you'll learn about Deep Q-Learning and train a DQN agent to play Atari games using RL-Baselines3-Zoo. You’ll be able to compare the results of your Q-Learning agent using the leaderboard The Deep Q-Learning chapter 👉 https://huggingface.co/blog/deep-rl-dqn The hands-on 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit3/unit3.ipynb The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard https://i.redd.it/mq8fqnmkxe491.gif Deep RL Class, is a free course from beginner to expert, self-paced where you’ll get solid foundations of Deep Reinforcement Learning in theory and practice with hands-on using famous RL libraries such SB3, RL-Baselines3-Zoo, RLlib, CleanRL… You can sign up here 👉 http://eepurl.com/h1pElX And if you have questions and feedback I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 1 min )
    Performance of RL vs supervised learning
    I was wondering if there were any studies directly comparing the two. I want to predict the next state in an environment and can either use RL to do so or generate a dataset and do supervised learning on that. Which do you hypothesise to be better and why? submitted by /u/SuperDuperDooken [link] [comments]  ( 1 min )
    Looking for implementation of normalised percentiles for evaluating RL agents
    I was wondering where I could find a software implementation of the technique used in the work "Open-Ended Learning Leads to Generally Capable Agents" (https://arxiv.org/abs/2107.12808) for evaluating agents. It requires computing normalised percentiles and pareto dominance and is described in Section 4.1. submitted by /u/dr_cosmicomical [link] [comments]  ( 1 min )
    Inference with Rainbow
    Hi guys! I am using Rainbow for an environment, and I see progress in the training logs. However, when I want to test my model checkpoints I see the agent only commits to only one action, and of course does not achieve the performance shown in training. What do you think can be the causes? Or what specific thing has to be done with rainbow when doing inference? ​ Thank you! submitted by /u/xWh0am1 [link] [comments]  ( 1 min )
    Have you used any good DRL library?
    Hey, friends, have you used some useful DRL libraries? I hope you can recommend some useful DRL libraries to me! Or what should I pay attention to when choosing a library? ​ I found this summary on github, and it looks pretty complete: https://github.com/wwxFromTju/awesome-reinforcement-learning-lib submitted by /u/AnnualGas3585 [link] [comments]  ( 1 min )
  • Open

    Stanford AI Researchers Propose ‘LinkBERT’: A New Pretraining Method That Improves Language Model Training with Document Links
    👉 LinkBERT consists of three steps: (1) obtaining links between documents to build a document graph from the text corpus, (2) creating link-aware training instances from the graph by placing linked documents together, and finally (3) pretraining the LM with link-aware self-supervised tasks: masked language modeling (MLM) and document relation prediction (DRP). 👉 LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA) Continue reading | Check out the paper, github and blog post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    DISCO DIFFUSION 3D AI ART ANIMATION | TRANQUIL BLISS
    submitted by /u/Available_Tadpole829 [link] [comments]
    New Artificial Skin Lets Bionic Arm Or AI Robot Touch & Feel With Extreme Sensitivity | Photonic Chip Processes & Classifies 2 Billion Images Per Second Without Memory Device
    submitted by /u/getrich_or_diemining [link] [comments]
    DISCO DIFFUSION 3D AI ART ANIMATION | VANAHEIM HOME OF THE VANIR GODS
    submitted by /u/Available_Tadpole829 [link] [comments]
    Lamp Vase.
    submitted by /u/cookingandcraft [link] [comments]
    in this article, we showcase how to build an NLP project from zero to hero
    submitted by /u/UBIAI [link] [comments]
    Aquaman - Neural-Art Parody / [4K] Creative Experiment w/ GPT-3, VQGAN+CLIP
    submitted by /u/MLInsights [link] [comments]
    Is it possible
    Is it possible to make an ai to play games with you anything that can allow you to have two players or split screen submitted by /u/OrdinarySlight6992 [link] [comments]  ( 1 min )
    Self study plan for AI?
    I am a recent high school graduate. I have been very eager to begin dabbling with AI this summer. So far, I have been following "Artificial Intelligence: A Modern Approach" and I have reached the second chapter over the past few weeks, but I do not yet have a solid learning plan. I just study bit by bit every other day. I would like to form a solid plan for this summer and I was wondering if anyone has any advice for me. I've completed Calculus 1 in school and I am considering studying Linear Algebra along with AI, but I would like to have some advice on that as well. Is going through AIMA over summer a good plan? Should I start linear algebra along with it? How do I make a study plan that will make me end up actually learning something by the end of summer? If AIMA is not the best resource for my case, what do you recommend for me to follow and what kind of plan should I build? Thank you so much in advance! submitted by /u/obvslynot [link] [comments]  ( 2 min )
    ML is way more fun when you learn/work with someone. A Discord server where anyone learning/working in ML can come and share their projects, learn together, find jobs, and much more now with 25'000+ members.
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Open AI...asking for a phone number => not Open then (personal data)
    submitted by /u/the_anonymizer [link] [comments]
    Just launched - nebulgym, a new open-source that accelerates AI training (~1.5-2x as of now) in a few lines of code without requiring you to change your training setup
    Training always takes too long. If it takes an hour, it would be better if it took 30 minutes, or maybe 15 minutes... or just 1 minute, why not? And if you want to speed up training, the techs available usually require to increase the complexity of the training process, whether it's making trade-off in terms of accuracy or time for the developer to learn a new framework. Often times it's trial and error, playing with parameters, training recipes, or switching framework/model. That's definitely not ideal. “Fast & easy-to-use” These were keywords that motivated me to work on a new way of doing training, the library nebulgym, which now is open-source (github link). Fast Training should be fast, period. Wouldn't it be great if in the near future you could train a GPT3 from scratch on your l…  ( 2 min )
    Awesome AI R&D content (with code!) on Computer Vision News of June 2022
    Dear all, Here is awesome AI R&D content (with code!) on Computer Vision News of June 2022. Many great articles (with videos) about AI, Deep Learning, Computer Vision and more... Review of award-winning CRAS2022 and ICLR2022 papers. HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 66. Enjoy! https://preview.redd.it/892fxez70d491.jpg?width=400&format=pjpg&auto=webp&s=92fd215861578e2c8082ba1c60d2643749eb36a5 submitted by /u/Gletta [link] [comments]
    Doctor Strange in the Multiverse of Madness - Neural-Art Parody [4K 60 FPS]
    submitted by /u/MLInsights [link] [comments]
    DALL-E Mini nailed it
    submitted by /u/OneFinding1429 [link] [comments]  ( 1 min )
    Love: A Powerful Force! - [4K 60 FPS] Computer Generated Art
    submitted by /u/MLInsights [link] [comments]  ( 1 min )
  • Open

    Integrate Amazon Lex and Uneeq’s digital human platform
    In today’s digital landscape, customers are expecting a high-quality experience that is responsive and delightful. Chatbots and virtual assistants have transformed the customer experience from a point-and-click or a drag-and-drop experience to one that is driven by voice or text. You can create a more engaging experience by further augmenting the interaction with a visual […]  ( 6 min )
    Easily create and store features in Amazon SageMaker without code
    Data scientists and machine learning (ML) engineers often prepare their data before building ML models. Data preparation typically includes data preprocessing and feature engineering. You preprocess data by transforming data into the right shape and quality for training, and you engineer features by selecting, transforming, and creating variables when building a predictive model. Amazon SageMaker […]  ( 9 min )
  • Open

    New Photonics AI Chip Processes & Classifies 2 Billion Images Per Second Without Using Memory Device
    submitted by /u/tohelpyou88 [link] [comments]
    This cheat sheet provides you with six steps that you can go through to make neural networks in Python with the Keras library.
    submitted by /u/joanna58 [link] [comments]
  • Open

    Infinite periodic table
    All the chemical elements discovered or created so far follow a regular pattern in how their electrons are arranged: the nth shell contains up to 2n – 1 suborbitals that each contain up to two electrons. For a given atomic number, you can determine how its electrons are distributed into shells and suborbitals using the […] Infinite periodic table first appeared on John D. Cook.  ( 2 min )
  • Open

    Stunning Insights from James Webb Space Telescope Are Coming, Thanks to GPU-Powered Deep Learning
    NVIDIA GPUs will play a key role interpreting data streaming in from the James Webb Space Telescope, with NASA preparing to release next month the first full-color images from the $10 billion scientific instrument. The telescope’s iconic array of 18 interlocking hexagonal mirrors, which span a total of 21 feet 4 inches, will be able Read article > The post Stunning Insights from James Webb Space Telescope Are Coming, Thanks to GPU-Powered Deep Learning appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    DSC Weekly 7 June 2022
    Announcements Building a successful data architecture strategy continues to challenge businesses as data management growth and innovation continues through 2022. Discover the blueprint for managing data by joining the Data Architecture & Engineering summit and get ahead with the latest technologies to remain competitive. Companies must effectively manage hybrid cloud operations to manage risk and leverage its… Read More »DSC Weekly 7 June 2022 The post DSC Weekly 7 June 2022 appeared first on Data Science Central.  ( 7 min )
  • Open

    Parotid Gland MRI Segmentation Based on Swin-Unet and Multimodal Images. (arXiv:2206.03336v1 [eess.IV])
    Parotid gland tumors account for approximately 2% to 10% of head and neck tumors. Preoperative tumor localization, differential diagnosis, and subsequent selection of appropriate treatment for parotid gland tumors is critical. However, the relative rarity of these tumors and the highly dispersed tissue types have left an unmet need for a subtle differential diagnosis of such neoplastic lesions based on preoperative radiomics. Recently, deep learning methods have developed rapidly, especially Transformer beats the traditional convolutional neural network in computer vision. Many new Transformer-based networks have been proposed for computer vision tasks. In this study, multicenter multimodal parotid gland MRI images were collected. The Swin-Unet which was based on Transformer was used. MRI images of STIR, T1 and T2 modalities were combined into a three-channel data to train the network. We achieved segmentation of the region of interest for parotid gland and tumor. The DSC of the model on the test set was 88.63%, MPA was 99.31%, MIoU was 83.99%, and HD was 3.04. Then a series of comparison experiments were designed in this paper to further validate the segmentation performance of the algorithm.  ( 2 min )
    Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse. (arXiv:2206.03126v1 [cs.LG])
    Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.  ( 2 min )
    Deep Learning-based FEA surrogate for sub-sea pressure vessel. (arXiv:2206.03322v1 [cs.LG])
    During the design process of an autonomous underwater vehicle (AUV), the pressure vessel has a critical role. The pressure vessel contains dry electronics, power sources, and other sensors that can not be flooded. A traditional design approach for a pressure vessel design involves running multiple Finite Element Analysis (FEA) based simulations and optimizing the design to find the best suitable design which meets the requirement. Running these FEAs are computationally very costly for any optimization process and it becomes difficult to run even hundreds of evaluation. In such a case, a better approach is the surrogate design with the goal of replacing FEA-based prediction with some learning-based regressor. Once the surrogate is trained for a class of problem, then the learned response surface can be used to analyze the stress effect without running the FEA for that class of problem. The challenge of creating a surrogate for a class of problems is data generation. Since the process is computationally costly, it is not possible to densely sample the design space and the learning response surface on sparse data set becomes difficult. During experimentation, we observed that a Deep Learning-based surrogate outperforms other regression models on such sparse data. In the present work, we are utilizing the Deep Learning-based model to replace the costly finite element analysis-based simulation process. By creating the surrogate we speed up the prediction on the other design much faster than direct Finite element Analysis. We also compared our DL-based surrogate with other classical Machine Learning (ML) based regression models( random forest and Gradient Boost regressor). We observed on the sparser data, the DL-based surrogate performs much better than other regression models.  ( 2 min )
    On Recoverability of Graph Neural Network Representations. (arXiv:2201.12843v2 [cs.LG] UPDATED)
    Despite their growing popularity, graph neural networks (GNNs) still have multiple unsolved problems, including lack of embedding expressiveness, propagation of information to distant nodes, and training on large-scale graphs. Understanding the roots of and providing solutions for such problems require developing analytic tools and techniques. In this work, we propose the notion of recoverability, which measures the amount of information contained in a random variable for being able to recover another one form it. We provide a method for an efficient empirical estimation of recoverability, demonstrate a tight relationship of it to information aggregation in GNNs, and show how this new concept can be used in unsupervised graph representation learning. We demonstrate, through extensive experimental results on various datasets and different GNN architectures, that estimated recoverability correlates with aggregation method expressivity and graph sparsification quality, the GNN representations can be learned using our unsupervised approach, and the recoverability regularization can mitigating accuracy drop caused by expanding of GNN depth. The code to reproduce our experiments is available at https://github.com/Anonymous1252022/Recoverability  ( 2 min )
    Accurate Virus Identification with Interpretable Raman Signatures by Machine Learning. (arXiv:2206.02788v1 [q-bio.QM])
    Rapid identification of newly emerging or circulating viruses is an important first step toward managing the public health response to potential outbreaks. A portable virus capture device coupled with label-free Raman Spectroscopy holds the promise of fast detection by rapidly obtaining the Raman signature of a virus followed by a machine learning approach applied to recognize the virus based on its Raman spectrum, which is used as a fingerprint. We present such a machine learning approach for analyzing Raman spectra of human and avian viruses. A Convolutional Neural Network (CNN) classifier specifically designed for spectral data achieves very high accuracy for a variety of virus type or subtype identification tasks. In particular, it achieves 99% accuracy for classifying influenza virus type A vs. type B, 96% accuracy for classifying four subtypes of influenza A, 95% accuracy for differentiating enveloped and non-enveloped viruses, and 99% accuracy for differentiating avian coronavirus (infectious bronchitis virus, IBV) from other avian viruses. Furthermore, interpretation of neural net responses in the trained CNN model using a full-gradient algorithm highlights Raman spectral ranges that are most important to virus identification. By correlating ML-selected salient Raman ranges with the signature ranges of known biomolecules and chemical functional groups (for example, amide, amino acid, carboxylic acid), we verify that our ML model effectively recognizes the Raman signatures of proteins, lipids and other vital functional groups present in different viruses and uses a weighted combination of these signatures to identify viruses.  ( 3 min )
    Look Back When Surprised: Stabilizing Reverse Experience Replay for Neural Approximation. (arXiv:2206.03171v1 [cs.LG])
    Experience replay methods, which are an essential part of reinforcement learning(RL) algorithms, are designed to mitigate spurious correlations and biases while learning from temporally dependent data. Roughly speaking, these methods allow us to draw batched data from a large buffer such that these temporal correlations do not hinder the performance of descent algorithms. In this experimental work, we consider the recently developed and theoretically rigorous reverse experience replay (RER), which has been shown to remove such spurious biases in simplified theoretical settings. We combine RER with optimistic experience replay (OER) to obtain RER++, which is stable under neural function approximation. We show via experiments that this has a better performance than techniques like prioritized experience replay (PER) on various tasks, with a significantly smaller computational complexity. It is well known in the RL literature that choosing examples greedily with the largest TD error (as in OER) or forming mini-batches with consecutive data points (as in RER) leads to poor performance. However, our method, which combines these techniques, works very well.  ( 2 min )
    Cycle-Consistent World Models for Domain Independent Latent Imagination. (arXiv:2110.00808v2 [cs.LG] UPDATED)
    End-to-end autonomous driving seeks to solve the perception, decision, and control problems in an integrated way, which can be easier to generalize at scale and be more adapting to new scenarios. However, high costs and risks make it very hard to train autonomous cars in the real world. Simulations can therefore be a powerful tool to enable training. Due to slightly different observations, agents trained and evaluated solely in simulation often perform well there but have difficulties in real-world environments. To tackle this problem, we propose a novel model-based reinforcement learning approach called Cycleconsistent World Models. Contrary to related approaches, our model can embed two modalities in a shared latent space and thereby learn from samples in one modality (e.g., simulated data) and be used for inference in different domain (e.g., real-world data). Our experiments using different modalities in the CARLA simulator showed that this enables CCWM to outperform state-of-the-art domain adaptation approaches. Furthermore, we show that CCWM can decode a given latent representation into semantically coherent observations in both modalities.  ( 2 min )
    Mean Estimation in High-Dimensional Binary Markov Gaussian Mixture Models. (arXiv:2206.02455v2 [math.ST] UPDATED)
    We consider a high-dimensional mean estimation problem over a binary hidden Markov model, which illuminates the interplay between memory in data, sample size, dimension, and signal strength in statistical inference. In this model, an estimator observes $n$ samples of a $d$-dimensional parameter vector $\theta_{*}\in\mathbb{R}^{d}$, multiplied by a random sign $ S_i $ ($1\le i\le n$), and corrupted by isotropic standard Gaussian noise. The sequence of signs $\{S_{i}\}_{i\in[n]}\in\{-1,1\}^{n}$ is drawn from a stationary homogeneous Markov chain with flip probability $\delta\in[0,1/2]$. As $\delta$ varies, this model smoothly interpolates two well-studied models: the Gaussian Location Model for which $\delta=0$ and the Gaussian Mixture Model for which $\delta=1/2$. Assuming that the estimator knows $\delta$, we establish a nearly minimax optimal (up to logarithmic factors) estimation error rate, as a function of $\|\theta_{*}\|,\delta,d,n$. We then provide an upper bound to the case of estimating $\delta$, assuming a (possibly inaccurate) knowledge of $\theta_{*}$. The bound is proved to be tight when $\theta_{*}$ is an accurately known constant. These results are then combined to an algorithm which estimates $\theta_{*}$ with $\delta$ unknown a priori, and theoretical guarantees on its error are stated.  ( 2 min )
    A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning. (arXiv:2204.07492v2 [physics.ao-ph] UPDATED)
    Recently, the use of machine learning in meteorology has increased greatly. While many machine learning methods are not new, university classes on machine learning are largely unavailable to meteorology students and are not required to become a meteorologist. The lack of formal instruction has contributed to perception that machine learning methods are 'black boxes' and thus end-users are hesitant to apply the machine learning methods in their every day workflow. To reduce the opaqueness of machine learning methods and lower hesitancy towards machine learning in meteorology, this paper provides a survey of some of the most common machine learning methods. A familiar meteorological example is used to contextualize the machine learning methods while also discussing machine learning topics using plain language. The following machine learning methods are demonstrated: linear regression; logistic regression; decision trees; random forest; gradient boosted decision trees; naive Bayes; and support vector machines. Beyond discussing the different methods, the paper also contains discussions on the general machine learning process as well as best practices to enable readers to apply machine learning to their own datasets. Furthermore, all code (in the form of Jupyter notebooks and Google Colaboratory notebooks) used to make the examples in the paper is provided in an effort to catalyse the use of machine learning in meteorology.  ( 2 min )
    Forecasting COVID- 19 cases using Statistical Models and Ontology-based Semantic Modelling: A real time data analytics approach. (arXiv:2206.02795v1 [q-bio.PE])
    SARS-COV-19 is the most prominent issue which many countries face today. The frequent changes in infections, recovered and deaths represents the dynamic nature of this pandemic. It is very crucial to predict the spreading rate of this virus for accurate decision making against fighting with the situation of getting infected through the virus, tracking and controlling the virus transmission in the community. We develop a prediction model using statistical time series models such as SARIMA and FBProphet to monitor the daily active, recovered and death cases of COVID-19 accurately. Then with the help of various details across each individual patient (like height, weight, gender etc.), we designed a set of rules using Semantic Web Rule Language and some mathematical models for dealing with COVID19 infected cases on an individual basis. After combining all the models, a COVID-19 Ontology is developed and performs various queries using SPARQL query on designed Ontology which accumulate the risk factors, provide appropriate diagnosis, precautions and preventive suggestions for COVID Patients. After comparing the performance of SARIMA and FBProphet, it is observed that the SARIMA model performs better in forecasting of COVID cases. On individual basis COVID case prediction, approx. 497 individual samples have been tested and classified into five different levels of COVID classes such as Having COVID, No COVID, High Risk COVID case, Medium to High Risk case, and Control needed case.  ( 2 min )
    Future Artificial Intelligence tools and perspectives in medicine. (arXiv:2206.03289v1 [cs.LG])
    Purpose of review: Artificial intelligence (AI) has become popular in medical applications, specifically as a clinical support tool for computer-aided diagnosis. These tools are typically employed on medical data (i.e., image, molecular data, clinical variables, etc.) and used the statistical and machine learning methods to measure the model performance. In this review, we summarized and discussed the most recent radiomic pipeline used for clinical analysis. Recent findings:Currently, limited management of cancers benefits from artificial intelligence, mostly related to a computer-aided diagnosis that avoids a biopsy analysis that presents additional risks and costs. Most AI tools are based on imaging features, known as radiomic analysis that can be refined into predictive models in non-invasively acquired imaging data. This review explores the progress of AI-based radiomic tools for clinical applications with a brief description of necessary technical steps. Explaining new radiomic approaches based on deep learning techniques will explain how the new radiomic models (deep radiomic analysis) can benefit from deep convolutional neural networks and be applied on limited data sets. Summary: To consider the radiomic algorithms, further investigations are recommended to involve deep learning in radiomic models with additional validation steps on various cancer types.  ( 2 min )
    Beyond Faithfulness: A Framework to Characterize and Compare Saliency Methods. (arXiv:2206.02958v1 [cs.LG])
    Saliency methods calculate how important each input feature is to a machine learning model's prediction, and are commonly used to understand model reasoning. "Faithfulness", or how fully and accurately the saliency output reflects the underlying model, is an oft-cited desideratum for these methods. However, explanation methods must necessarily sacrifice certain information in service of user-oriented goals such as simplicity. To that end, and akin to performance metrics, we frame saliency methods as abstractions: individual tools that provide insight into specific aspects of model behavior and entail tradeoffs. Using this framing, we describe a framework of nine dimensions to characterize and compare the properties of saliency methods. We group these dimensions into three categories that map to different phases of the interpretation process: methodology, or how the saliency is calculated; sensitivity, or relationships between the saliency result and the underlying model or input; and, perceptibility, or how a user interprets the result. As we show, these dimensions give us a granular vocabulary for describing and comparing saliency methods -- for instance, allowing us to develop "saliency cards" as a form of documentation, or helping downstream users understand tradeoffs and choose a method for a particular use case. Moreover, by situating existing saliency methods within this framework, we identify opportunities for future work, including filling gaps in the landscape and developing new evaluation metrics.  ( 2 min )
    UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder. (arXiv:2206.02512v2 [eess.AS] UPDATED)
    In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer developed from the perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for the system development. Specifically, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts the FA to the unsupervised alignment (UA). Finally, a Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE), serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.  ( 2 min )
    On the balance between the training time and interpretability of neural ODE for time series modelling. (arXiv:2206.03304v1 [cs.LG])
    Most machine learning methods are used as a black box for modelling. We may try to extract some knowledge from physics-based training methods, such as neural ODE (ordinary differential equation). Neural ODE has advantages like a possibly higher class of represented functions, the extended interpretability compared to black-box machine learning models, ability to describe both trend and local behaviour. Such advantages are especially critical for time series with complicated trends. However, the known drawback is the high training time compared to the autoregressive models and long-short term memory (LSTM) networks widely used for data-driven time series modelling. Therefore, we should be able to balance interpretability and training time to apply neural ODE in practice. The paper shows that modern neural ODE cannot be reduced to simpler models for time-series modelling applications. The complexity of neural ODE is compared to or exceeds the conventional time-series modelling tools. The only interpretation that could be extracted is the eigenspace of the operator, which is an ill-posed problem for a large system. Spectra could be extracted using different classical analysis methods that do not have the drawback of extended time. Consequently, we reduce the neural ODE to a simpler linear form and propose a new view on time-series modelling using combined neural networks and an ODE system approach.
    Hierarchical Graph-Convolutional Variational AutoEncoding for Generative Modelling of Human Motion. (arXiv:2111.12602v4 [cs.CV] UPDATED)
    Models of human motion commonly focus either on trajectory prediction or action classification but rarely both. The marked heterogeneity and intricate compositionality of human motion render each task vulnerable to the data degradation and distributional shift common to real-world scenarios. A sufficiently expressive generative model of action could in theory enable data conditioning and distributional resilience within a unified framework applicable to both tasks. Here we propose a novel architecture based on hierarchical variational autoencoders and deep graph convolutional neural networks for generating a holistic model of action over multiple time-scales. We show this Hierarchical Graph-convolutional Variational Autoencoder (HG-VAE) to be capable of generating coherent actions, detecting out-of-distribution data, and imputing missing data by gradient ascent on the model's posterior. Trained and evaluated on H3.6M and the largest collection of open source human motion data, AMASS, we show HG-VAE can facilitate downstream discriminative learning better than baseline models.  ( 2 min )
    Time-series image denoising of pressure-sensitive paint data by projected multivariate singular spectrum analysis. (arXiv:2203.07574v2 [eess.IV] UPDATED)
    Time-series data, such as unsteady pressure-sensitive paint (PSP) measurement data, may contain a significant amount of random noise. Thus, in this study, we investigated a noise-reduction method that combines multivariate singular spectrum analysis (MSSA) with low-dimensional data representation. MSSA is a state-space reconstruction technique that utilizes time-delay embedding, and the low-dimensional representation is achieved by projecting data onto the singular value decomposition (SVD) basis. The noise-reduction performance of the proposed method for unsteady PSP data, i.e., the projected MSSA, is compared with that of the truncated SVD method, one of the most employed noise-reduction methods. The result shows that the projected MSSA exhibits better performance in reducing random noise than the truncated SVD method. Additionally, in contrast to that of the truncated SVD method, the performance of the projected MSSA is less sensitive to the truncation rank. Furthermore, the projected MSSA achieves denoising effectively by extracting smooth trajectories in a state space from noisy input data. Expectedly, the projected MSSA will be effective for reducing random noise in not only PSP measurement data, but also various high-dimensional time-series data.  ( 2 min )
    Combining physics-based and data-driven techniques for reliable hybrid analysis and modeling using the corrective source term approach. (arXiv:2206.03451v1 [cs.LG])
    Upcoming technologies like digital twins, autonomous, and artificial intelligent systems involving safety-critical applications require models which are accurate, interpretable, computationally efficient, and generalizable. Unfortunately, the two most commonly used modeling approaches, physics-based modeling (PBM) and data-driven modeling (DDM) fail to satisfy all these requirements. In the current work, we demonstrate how a hybrid approach combining the best of PBM and DDM can result in models which can outperform them both. We do so by combining partial differential equations based on first principles describing partially known physics with a black box DDM, in this case, a deep neural network model compensating for the unknown physics. First, we present a mathematical argument for why this approach should work and then apply the hybrid approach to model two dimensional heat diffusion problem with an unknown source term. The result demonstrates the method's superior performance in terms of accuracy, and generalizability. Additionally, it is shown how the DDM part can be interpreted within the hybrid framework to make the overall approach reliable.  ( 2 min )
    Robust Adversarial Attacks Detection based on Explainable Deep Reinforcement Learning For UAV Guidance and Planning. (arXiv:2206.02670v2 [cs.LG] UPDATED)
    The danger of adversarial attacks to unprotected Uncrewed Aerial Vehicle (UAV) agents operating in public is growing. Adopting AI-based techniques and more specifically Deep Learning (DL) approaches to control and guide these UAVs can be beneficial in terms of performance but add more concerns regarding the safety of those techniques and their vulnerability against adversarial attacks causing the chances of collisions going up as the agent becomes confused. This paper proposes an innovative approach based on the explainability of DL methods to build an efficient detector that will protect these DL schemes and thus the UAVs adopting them from potential attacks. The agent is adopting a Deep Reinforcement Learning (DRL) scheme for guidance and planning. It is formed and trained with a Deep Deterministic Policy Gradient (DDPG) with Prioritised Experience Replay (PER) DRL scheme that utilises Artificial Potential Field (APF) to improve training times and obstacle avoidance performance. The adversarial attacks are generated by Fast Gradient Sign Method (FGSM) and Basic Iterative Method (BIM) algorithms and reduced obstacle course completion rates from 80\% to 35\%. A Realistic Synthetic environment for UAV explainable DRL based planning and guidance including obstacles and adversarial attacks is built. Two adversarial attack detectors are proposed. The first one adopts a Convolutional Neural Network (CNN) architecture and achieves an accuracy in detection of 80\%. The second detector is developed based on a Long Short Term Memory (LSTM) network and achieves an accuracy of 91\% with much faster computing times when compared to the CNN based detector.  ( 2 min )
    An Embedding of ReLU Networks and an Analysis of their Identifiability. (arXiv:2107.09370v5 [cs.LG] UPDATED)
    Neural networks with the Rectified Linear Unit (ReLU) nonlinearity are described by a vector of parameters $\theta$, and realized as a piecewise linear continuous function $R_{\theta}: x \in \mathbb R^{d} \mapsto R_{\theta}(x) \in \mathbb R^{k}$. Natural scalings and permutations operations on the parameters $\theta$ leave the realization unchanged, leading to equivalence classes of parameters that yield the same realization. These considerations in turn lead to the notion of identifiability -- the ability to recover (the equivalence class of) $\theta$ from the sole knowledge of its realization $R_{\theta}$. The overall objective of this paper is to introduce an embedding for ReLU neural networks of any depth, $\Phi(\theta)$, that is invariant to scalings and that provides a locally linear parameterization of the realization of the network. Leveraging these two key properties, we derive some conditions under which a deep ReLU network is indeed locally identifiable from the knowledge of the realization on a finite set of samples $x_{i} \in \mathbb R^{d}$. We study the shallow case in more depth, establishing necessary and sufficient conditions for the network to be identifiable from a bounded subset $\mathcal X \subseteq \mathbb R^{d}$.  ( 2 min )
    Explaining the physics of transfer learning a data-driven subgrid-scale closure to a different turbulent flow. (arXiv:2206.03198v1 [physics.flu-dyn])
    Transfer learning (TL) is becoming a powerful tool in scientific applications of neural networks (NNs), such as weather/climate prediction and turbulence modeling. TL enables out-of-distribution generalization (e.g., extrapolation in parameters) and effective blending of disparate training sets (e.g., simulations and observations). In TL, selected layers of a NN, already trained for a base system, are re-trained using a small dataset from a target system. For effective TL, we need to know 1) what are the best layers to re-train? and 2) what physics are learned during TL? Here, we present novel analyses and a new framework to address (1)-(2) for a broad range of multi-scale, nonlinear systems. Our approach combines spectral analyses of the systems' data with spectral analyses of convolutional NN's activations and kernels, explaining the inner-workings of TL in terms of the system's nonlinear physics. Using subgrid-scale modeling of several setups of 2D turbulence as test cases, we show that the learned kernels are combinations of low-, band-, and high-pass filters, and that TL learns new filters whose nature is consistent with the spectral differences of base and target systems. We also find the shallowest layers are the best to re-train in these cases, which is against the common wisdom guiding TL in machine learning literature. Our framework identifies the best layer(s) to re-train beforehand, based on physics and NN theory. Together, these analyses explain the physics learned in TL and provide a framework to guide TL for wide-ranging applications in science and engineering, such as climate change modeling.  ( 2 min )
    Neural Network Decoders for Permutation Codes Correcting Different Errors. (arXiv:2206.03315v1 [cs.IT])
    Permutation codes were extensively studied in order to correct different types of errors for the applications on power line communication and rank modulation for flash memory. In this paper, we introduce the neural network decoders for permutation codes to correct these errors with one-shot decoding, which treat the decoding as $n$ classification tasks for non-binary symbols for a code of length $n$. These are actually the first general decoders introduced to deal with any error type for these two applications. The performance of the decoders is evaluated by simulations with different error models.  ( 2 min )
    Utility of Equivariant Message Passing in Cortical Mesh Segmentation. (arXiv:2206.03164v1 [cs.CV])
    The automated segmentation of cortical areas has been a long-standing challenge in medical image analysis. The complex geometry of the cortex is commonly represented as a polygon mesh, whose segmentation can be addressed by graph-based learning methods. When cortical meshes are misaligned across subjects, current methods produce significantly worse segmentation results, limiting their ability to handle multi-domain data. In this paper, we investigate the utility of E(n)-equivariant graph neural networks (EGNNs), comparing their performance against plain graph neural networks (GNNs). Our evaluation shows that GNNs outperform EGNNs on aligned meshes, due to their ability to leverage the presence of a global coordinate system. On misaligned meshes, the performance of plain GNNs drop considerably, while E(n)-equivariant message passing maintains the same segmentation results. The best results can also be obtained by using plain GNNs on realigned data (co-registered meshes in a global coordinate system).
    Unstructured Handwashing Recognition using Smartwatch to Reduce Contact Transmission of Pathogens. (arXiv:2107.13405v4 [cs.LG] UPDATED)
    Current guidelines from the World Health Organization indicate that the SARS-CoV-2 coronavirus, which results in the novel coronavirus disease (COVID-19), is transmitted through respiratory droplets or by contact. Contact transmission occurs when contaminated hands touch the mucous membrane of the mouth, nose, or eyes so hands hygiene is extremely important to prevent the spread of the SARSCoV-2 as well as of other pathogens. The vast proliferation of wearable devices, such as smartwatches, containing acceleration, rotation, magnetic field sensors, etc., together with the modern technologies of artificial intelligence, such as machine learning and more recently deep-learning, allow the development of accurate applications for recognition and classification of human activities such as: walking, climbing stairs, running, clapping, sitting, sleeping, etc. In this work, we evaluate the feasibility of a machine learning based system which, starting from inertial signals collected from wearable devices such as current smartwatches, recognizes when a subject is washing or rubbing its hands. Preliminary results, obtained over two different datasets, show a classification accuracy of about 95% and of about 94% for respectively deep and standard learning techniques.
    ByteComp: Revisiting Gradient Compression in Distributed Training. (arXiv:2205.14465v2 [cs.LG] UPDATED)
    Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express all compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy? In this paper, we propose ByteComp to answer these questions. It first designs a decision tree abstraction to express all the compression strategies and develops empirical models to timeline tensor computation, communication, and compression to enable ByteComp to derive the intricate interactions among tensors. It then designs a compression decision algorithm that analyzes tensor interactions to eliminate and prioritize strategies and optimally offloads compression to CPUs. Experimental evaluations show that ByteComp can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs. Moreover, the computational time needed to select the compression strategy is measured in milliseconds, and the selected strategy is only a few percent from optimal.
    Neuro-Symbolic Causal Language Planning with Commonsense Prompting. (arXiv:2206.02928v1 [cs.CL])
    Language planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps. Such procedural reasoning ability is essential for applications such as household robots and virtual assistants. Although language planning is a basic skill set for humans in daily life, it remains a challenge for large language models (LLMs) that lack deep-level commonsense knowledge in the real world. Previous methods require either manual exemplars or annotated programs to acquire such ability from LLMs. In contrast, this paper proposes Neuro-Symbolic Causal Language Planner (CLAP) that elicits procedural knowledge from the LLMs with commonsense-infused prompting. Pre-trained knowledge in LLMs is essentially an unobserved confounder that causes spurious correlations between tasks and action plans. Through the lens of a Structural Causal Model (SCM), we propose an effective strategy in CLAP to construct prompts as a causal intervention toward our SCM. Using graph sampling techniques and symbolic program executors, our strategy formalizes the structured causal prompts from commonsense knowledge bases. CLAP obtains state-of-the-art performance on WikiHow and RobotHow, achieving a relative improvement of 5.28% in human evaluations under the counterfactual setting. This indicates the superiority of CLAP in causal language planning semantically and sequentially.
    Building Robust Ensembles via Margin Boosting. (arXiv:2206.03362v1 [cs.LG])
    In the context of adversarial robustness, a single model does not usually have enough power to defend against all possible adversarial attacks, and as a result, has sub-optimal robustness. Consequently, an emerging line of work has focused on learning an ensemble of neural networks to defend against adversarial attacks. In this work, we take a principled approach towards building robust ensembles. We view this problem from the perspective of margin-boosting and develop an algorithm for learning an ensemble with maximum margin. Through extensive empirical evaluation on benchmark datasets, we show that our algorithm not only outperforms existing ensembling techniques, but also large models trained in an end-to-end fashion. An important byproduct of our work is a margin-maximizing cross-entropy (MCE) loss, which is a better alternative to the standard cross-entropy (CE) loss. Empirically, we show that replacing the CE loss in state-of-the-art adversarial training techniques with our MCE loss leads to significant performance improvement.
    Learning Backward Compatible Embeddings. (arXiv:2206.03040v1 [stat.ML])
    Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product recommendation). The produced embeddings are then widely consumed by consumer teams to solve their unintended tasks (e.g., fraud detection). However, as the embedding model gets updated and retrained to improve performance on the intended task, the newly-generated embeddings are no longer compatible with the existing consumer models. This means that historical versions of the embeddings can never be retired or all consumer teams have to retrain their models to make them compatible with the latest version of the embeddings, both of which are extremely costly in practice. Here we study the problem of embedding version updates and their backward compatibility. We formalize the problem where the goal is for the embedding team to keep updating the embedding version, while the consumer teams do not have to retrain their models. We develop a solution based on learning backward compatible embeddings, which allows the embedding model version to be updated frequently, while also allowing the latest version of the embedding to be quickly transformed into any backward compatible historical version of it, so that consumer teams do not have to retrain their models. Under our framework, we explore six methods and systematically evaluate them on a real-world recommender system application. We show that the best method, which we call BC-Aligner, maintains backward compatibility with existing unintended tasks even after multiple model version updates. Simultaneously, BC-Aligner achieves the intended task performance similar to the embedding model that is solely optimized for the intended task.
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v2 [cs.LG] UPDATED)
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
    Computational Doob's $h$-transforms for Online Filtering of Discretely Observed Diffusions. (arXiv:2206.03369v1 [stat.ML])
    This paper is concerned with online filtering of discretely observed nonlinear diffusion processes. Our approach is based on the fully adapted auxiliary particle filter, which involves Doob's $h$-transforms that are typically intractable. We propose a computational framework to approximate these $h$-transforms by solving the underlying backward Kolmogorov equations using nonlinear Feynman-Kac formulas and neural networks. The methodology allows one to train a locally optimal particle filter prior to the data-assimilation procedure. Numerical experiments illustrate that the proposed approach can be orders of magnitude more efficient than the bootstrap particle filter in the regime of highly informative observations, when the observations are extreme under the model, and if the state dimension is large.
    Inferring Unfairness and Error from Population Statistics in Binary and Multiclass Classification. (arXiv:2206.03234v1 [cs.LG])
    We propose methods for making inferences on the fairness and accuracy of a given classifier, using only aggregate population statistics. This is necessary when it is impossible to obtain individual classification data, for instance when there is no access to the classifier or to a representative individual-level validation set. We study fairness with respect to the equalized odds criterion, which we generalize to multiclass classification. We propose a measure of unfairness with respect to this criterion, which quantifies the fraction of the population that is treated unfairly. We then show how inferences on the unfairness and error of a given classifier can be obtained using only aggregate label statistics such as the rate of prediction of each label in each sub-population, as well as the true rate of each label. We derive inference procedures for binary classifiers and for multiclass classifiers, for the case where confusion matrices in each sub-population are known, and for the significantly more challenging case where they are unknown. We report experiments on data sets representing diverse applications, which demonstrate the effectiveness and the wide range of possible uses of the proposed methodology.
    On the Convergence of Clustered Federated Learning. (arXiv:2202.06187v2 [cs.LG] UPDATED)
    Knowledge sharing and model personalization are essential components to tackle the non-IID challenge in federated learning (FL). Most existing FL methods focus on two extremes: 1) to learn a shared model to serve all clients with non-IID data, and 2) to learn personalized models for each client, namely personalized FL. There is a trade-off solution, namely clustered FL or cluster-wise personalized FL, which aims to cluster similar clients into one cluster, and then learn a shared model for all clients within a cluster. This paper is to revisit the research of clustered FL by formulating them into a bi-level optimization framework that could unify existing methods. We propose a new theoretical analysis framework to prove the convergence by considering the clusterability among clients. In addition, we embody this framework in an algorithm, named Weighted Clustered Federated Learning (WeCFL). Empirical analysis verifies the theoretical results and demonstrates the effectiveness of the proposed WeCFL under the proposed cluster-wise non-IID settings.
    Assessing Project-Level Fine-Tuning of ML4SE Models. (arXiv:2206.03333v1 [cs.SE])
    Machine Learning for Software Engineering (ML4SE) is an actively growing research area that focuses on methods that help programmers in their work. In order to apply the developed methods in practice, they need to achieve reasonable quality in order to help rather than distract developers. While the development of new approaches to code representation and data collection improves the overall quality of the models, it does not take into account the information that we can get from the project at hand. In this work, we investigate how the model's quality can be improved if we target a specific project. We develop a framework to assess quality improvements that models can get after fine-tuning for the method name prediction task on a particular project. We evaluate three models of different complexity and compare their quality in three settings: trained on a large dataset of Java projects, further fine-tuned on the data from a particular project, and trained from scratch on this data. We show that per-project fine-tuning can greatly improve the models' quality as they capture the project's domain and naming conventions. We open-source the tool we used for data collection, as well as the code to run the experiments: https://zenodo.org/record/6040745.
    Lottery Tickets with Nonzero Biases. (arXiv:2110.11150v2 [cs.LG] UPDATED)
    The strong lottery ticket hypothesis holds the promise that pruning randomly initialized deep neural networks could offer a computationally efficient alternative to deep learning with stochastic gradient descent. Common parameter initialization schemes and existence proofs, however, are focused on networks with zero biases, thus foregoing the potential universal approximation property of pruning. To fill this gap, we extend multiple initialization schemes and existence proofs to nonzero biases, including explicit 'looks-linear' approaches for ReLU activation functions. These do not only enable truly orthogonal parameter initialization but also reduce potential pruning errors. In experiments on standard benchmark data, we further highlight the practical benefits of nonzero bias initialization schemes, and present theoretically inspired extensions for state-of-the-art strong lottery ticket pruning.
    GAAF: Searching Activation Functions for Binary Neural Networks through Genetic Algorithm. (arXiv:2206.03291v1 [cs.NE])
    Binary neural networks (BNNs) show promising utilization in cost and power-restricted domains such as edge devices and mobile systems. This is due to its significantly less computation and storage demand, but at the cost of degraded performance. To close the accuracy gap, in this paper we propose to add a complementary activation function (AF) ahead of the sign based binarization, and rely on the genetic algorithm (GA) to automatically search for the ideal AFs. These AFs can help extract extra information from the input data in the forward pass, while allowing improved gradient approximation in the backward pass. Fifteen novel AFs are identified through our GA-based search, while most of them show improved performance (up to 2.54% on ImageNet) when testing on different datasets and network models. Our method offers a novel approach for designing general and application-specific BNN architecture. Our code is available at this http URL
    Adversarial Reprogramming Revisited. (arXiv:2206.03466v1 [cs.LG])
    Adversarial reprogramming, introduced by Elsayed, Goodfellow, and Sohl-Dickstein, seeks to repurpose a neural network to perform a different task, by manipulating its input without modifying its weights. We prove that two-layer ReLU neural networks with random weights can be adversarially reprogrammed to achieve arbitrarily high accuracy on Bernoulli data models over hypercube vertices, provided the network width is no greater than its input dimension. We also substantially strengthen a recent result of Phuong and Lampert on directional convergence of gradient flow, and obtain as a corollary that training two-layer ReLU neural networks on orthogonally separable datasets can cause their adversarial reprogramming to fail. We support these theoretical results by experiments that demonstrate that, as long as batch normalisation layers are suitably initialised, even untrained networks with random weights are susceptible to adversarial reprogramming. This is in contrast to observations in several recent works that suggested that adversarial reprogramming is not possible for untrained networks to any degree of reliability.
    Neural Lagrangian Schr\"odinger Bridge. (arXiv:2204.04853v3 [cs.LG] UPDATED)
    Population dynamics is the study of temporal and spatial variation in the size of populations of organisms and is a major part of population ecology. One of the main difficulties in analyzing population dynamics is that we can only obtain observation data with coarse time intervals from fixed-point observations due to experimental costs or measurement constraints. Recently, modeling population dynamics by using continuous normalizing flows (CNFs) and dynamic optimal transport has been proposed to infer the sample trajectories from a fixed-point observed population. While the sample behavior in CNFs is deterministic, the actual sample in biological systems moves in an essentially random yet directional manner. Moreover, when a sample moves from point A to point B in dynamical systems, its trajectory typically follows the principle of least action in which the corresponding action has the smallest possible value. To satisfy these requirements of the sample trajectories, we formulate the Lagrangian Schr\"odinger bridge (LSB) problem and propose to solve it approximately using neural SDE with regularization. We also develop a model architecture that enables faster computation. Experimental results show that the proposed method can efficiently approximate the population-level dynamics even for high-dimensional data and that using the prior knowledge introduced by the Lagrangian enables us to estimate the trajectories of individual samples with stochastic behavior.
    Improving the Diagnosis of Psychiatric Disorders with Self-Supervised Graph State Space Models. (arXiv:2206.03331v1 [cs.LG])
    Single subject prediction of brain disorders from neuroimaging data has gained increasing attention in recent years. Yet, for some heterogeneous disorders such as major depression disorder (MDD) and autism spectrum disorder (ASD), the performance of prediction models on large-scale multi-site datasets remains poor. We present a two-stage framework to improve the diagnosis of heterogeneous psychiatric disorders from resting-state functional magnetic resonance imaging (rs-fMRI). First, we propose a self-supervised mask prediction task on data from healthy individuals that can exploit differences between healthy controls and patients in clinical datasets. Next, we train a supervised classifier on the learned discriminative representations. To model rs-fMRI data, we develop Graph-S4; an extension to the recently proposed state-space model S4 to graph settings where the underlying graph structure is not known in advance. We show that combining the framework and Graph-S4 can significantly improve the diagnostic performance of neuroimaging-based single subject prediction models of MDD and ASD on three open-source multi-center rs-fMRI clinical datasets.
    Learning in Observable POMDPs, without Computationally Intractable Oracles. (arXiv:2206.03446v1 [cs.LG])
    Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.
    On Efficient Approximate Queries over Machine Learning Models. (arXiv:2206.02845v1 [cs.DB])
    The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the objects in the DB. It relies on two assumptions. Under the Proxy Quality assumption, proxy quality can be quantified in a probabilistic manner w.r.t. the oracle. This allows us to develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.
    Patch-based image Super Resolution using generalized Gaussian mixture model. (arXiv:2206.03069v1 [eess.IV])
    Single Image Super Resolution (SISR) methods aim to recover the clean images in high resolution from low resolution observations.A family of patch-based approaches have received considerable attention and development. The minimum mean square error (MMSE) methodis a powerful image restoration method that uses a probability model on the patches of images. This paper proposes an algorithm to learn a jointgeneralized Gaussian mixture model (GGMM) from a pair of the low resolution patches and the corresponding high resolution patches fromthe reference data. We then reconstruct the high resolution image based on the MMSE method. Our numerical evaluations indicate that theMMSE-GGMM method competes with other state of the art methods.
    Improving Fairness in Graph Neural Networks via Mitigating Sensitive Attribute Leakage. (arXiv:2206.03426v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown great power in learning node representations on graphs. However, they may inherit historical prejudices from training data, leading to discriminatory bias in predictions. Although some work has developed fair GNNs, most of them directly borrow fair representation learning techniques from non-graph domains without considering the potential problem of sensitive attribute leakage caused by feature propagation in GNNs. However, we empirically observe that feature propagation could vary the correlation of previously innocuous non-sensitive features to the sensitive ones. This can be viewed as a leakage of sensitive information which could further exacerbate discrimination in predictions. Thus, we design two feature masking strategies according to feature correlations to highlight the importance of considering feature propagation and correlation variation in alleviating discrimination. Motivated by our analysis, we propose Fair View Graph Neural Network (FairVGNN) to generate fair views of features by automatically identifying and masking sensitive-correlated features considering correlation variation after feature propagation. Given the learned fair views, we adaptively clamp weights of the encoder to avoid using sensitive-related features. Experiments on real-world datasets demonstrate that FairVGNN enjoys a better trade-off between model utility and fairness. Our code is publicly available at \href{https://github.com/YuWVandy/FairVGNN}{\textcolor{blue}{https://github.com/YuWVandy/FairVGNN}}.
    Yet Another Representation of Binary Decision Trees: A Mathematical Demonstration. (arXiv:2101.07077v5 [cs.LG] UPDATED)
    A decision tree looks like a simple computational graph without cycles, where only the leaf nodes specify the output values and the non-terminals specify their tests or split conditions. From the numerical perspective, we express decision trees in the language of computational graph. We explicitly parameterize the test phase, traversal phase and prediction phase of decision trees based on the bitvectors of non-terminal nodes. As shown later, the decision tree is a shallow binary network in some sense. Especially, we introduce the bitvector matrix to implement the tree traversal in numerical approach, where the core is to convert the logical `AND' operation to arithmetic operations. And we apply this numerical representation to extend and unify diverse decision trees in concept.
    Efficient entity-based reinforcement learning. (arXiv:2206.02855v1 [cs.LG])
    Recent deep reinforcement learning (DRL) successes rely on end-to-end learning from fixed-size observational inputs (e.g. image, state-variables). However, many challenging and interesting problems in decision making involve observations or intermediary representations which are best described as a set of entities: either the image-based approach would miss small but important details in the observations (e.g. ojects on a radar, vehicles on satellite images, etc.), the number of sensed objects is not fixed (e.g. robotic manipulation), or the problem simply cannot be represented in a meaningful way as an image (e.g. power grid control, or logistics). This type of structured representations is not directly compatible with current DRL architectures, however, there has been an increase in machine learning techniques directly targeting structured information, potentially addressing this issue. We propose to combine recent advances in set representations with slot attention and graph neural networks to process structured data, broadening the range of applications of DRL algorithms. This approach allows to address entity-based problems in an efficient and scalable way. We show that it can improve training time and robustness significantly, and demonstrate their potential to handle structured as well as purely visual domains, on multiple environments from the Atari Learning Environment and Simple Playgrounds.
    Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering. (arXiv:2206.03112v1 [cs.LG])
    The ecological validity of soundscape studies usually rests on a choice of soundscapes that are representative of the perceptual space under investigation. For example, a soundscape pleasantness study might investigate locations with soundscapes ranging from "pleasant" to "annoying". The choice of soundscapes is typically researcher-led, but a participant-led process can reduce selection bias and improve result reliability. Hence, we propose a robust participant-led method to pinpoint characteristic soundscapes possessing arbitrary perceptual attributes. We validate our method by identifying Singaporean soundscapes spanning the perceptual quadrants generated from the "Pleasantness" and "Eventfulness" axes of the ISO 12913-2 circumplex model of soundscape perception, as perceived by local experts. From memory and experience, 67 participants first selected locations corresponding to each perceptual quadrant in each major planning region of Singapore. We then performed weighted k-means clustering on the selected locations, with weights for each location derived from previous frequencies and durations spent in each location by each participant. Weights hence acted as proxies for participant confidence. In total, 62 locations were thereby identified as suitable locations with characteristic soundscapes for further research utilizing the ISO 12913-2 perceptual quadrants. Audio-visual recordings and acoustic characterization of the soundscapes will be made in a future study.
    Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems. (arXiv:2206.03326v1 [cs.LG])
    Deep Neural Networks (DNNs) have achieved great success in a massive number of artificial intelligence (AI) applications by delivering high-quality computer vision, natural language processing, and virtual reality applications. However, these emerging AI applications also come with increasing computation and memory demands, which are challenging to handle especially for the embedded systems where limited computation/memory resources, tight power budgets, and small form factors are demanded. Challenges also come from the diverse application-specific requirements, including real-time responses, high-throughput performance, and reliable inference accuracy. To address these challenges, we will introduce a series of effective design methods in this book chapter to enable efficient algorithms, compilers, and various optimizations for embedded systems.
    Plant 'n' Seek: Can You Find the Winning Ticket?. (arXiv:2111.11153v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis has sparked the rapid development of pruning algorithms that aim to reduce the computational costs associated with deep learning during training and model deployment. Currently, such algorithms are primarily evaluated on imaging data, for which we lack ground truth information and thus the understanding of how sparse lottery tickets could be. To fill this gap, we develop a framework that allows us to plant and hide winning tickets with desirable properties in randomly initialized neural networks. To analyze the ability of state-of-the-art pruning to identify tickets of extreme sparsity, we design and hide such tickets solving four challenging tasks. In extensive experiments, we observe similar trends as in imaging studies, indicating that our framework can provide transferable insights into realistic problems. Additionally, we can now see beyond such relative trends and highlight limitations of current pruning methods. Based on our results, we conclude that the current limitations in ticket sparsity are likely of algorithmic rather than fundamental nature. We anticipate that comparisons to planted tickets will facilitate future developments of efficient pruning algorithms.
    SubStrat: A Subset-Based Strategy for Faster AutoML. (arXiv:2206.03070v1 [cs.LG])
    Automated machine learning (AutoML) frameworks have become important tools in the data scientists' arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines - typically containing feature engineering, model selection and hyper parameters tuning steps - and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset which preserves a particular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on two popular AutoML frameworks, Auto-Sklearn and TPOT, show that SubStrat reduces their running times by 79% (on average), with less than 2% average loss in the accuracy of the resulted ML pipeline.
    Molecular Representation Learning via Heterogeneous Motif Graph Neural Networks. (arXiv:2202.00529v2 [cs.LG] UPDATED)
    We consider feature representation learning problem of molecular graphs. Graph Neural Networks have been widely used in feature representation learning of molecular graphs. However, most existing methods deal with molecular graphs individually while neglecting their connections, such as motif-level relationships. We propose a novel molecular graph representation learning method by constructing a heterogeneous motif graph to address this issue. In particular, we build a heterogeneous motif graph that contains motif nodes and molecular nodes. Each motif node corresponds to a motif extracted from molecules. Then, we propose a Heterogeneous Motif Graph Neural Network (HM-GNN) to learn feature representations for each node in the heterogeneous motif graph. Our heterogeneous motif graph also enables effective multi-task learning, especially for small molecular datasets. To address the potential efficiency issue, we propose to use an edge sampler, which can significantly reduce computational resources usage. The experimental results show that our model consistently outperforms previous state-of-the-art models. Under multi-task settings, the promising performances of our methods on combined datasets shed light on a new learning paradigm for small molecular datasets. Finally, we show that our model achieves similar performances with significantly less computational resources by using our edge sampler.
    Variable-rate hierarchical CPC leads to acoustic unit discovery in speech. (arXiv:2206.02211v2 [cs.SD] UPDATED)
    The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.
    FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning. (arXiv:2206.03200v1 [cs.LG])
    Vertical federated learning (VFL) is a privacy-preserving machine learning paradigm that can learn models from features distributed on different platforms in a privacy-preserving way. Since in real-world applications the data may contain bias on fairness-sensitive features (e.g., gender), VFL models may inherit bias from training data and become unfair for some user groups. However, existing fair ML methods usually rely on the centralized storage of fairness-sensitive features to achieve model fairness, which are usually inapplicable in federated scenarios. In this paper, we propose a fair vertical federated learning framework (FairVFL), which can improve the fairness of VFL models. The core idea of FairVFL is to learn unified and fair representations of samples based on the decentralized feature fields in a privacy-preserving way. Specifically, each platform with fairness-insensitive features first learns local data representations from local features. Then, these local representations are uploaded to a server and aggregated into a unified representation for the target task. In order to learn fair unified representations, we send them to each platform storing fairness-sensitive features and apply adversarial learning to remove bias from the unified representations inherited from the biased data. Moreover, for protecting user privacy, we further propose a contrastive adversarial learning method to remove privacy information from the unified representations in server before sending them to the platforms keeping fairness-sensitive features. Experiments on two real-world datasets validate that our method can effectively improve model fairness with user privacy well-protected.
    TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining. (arXiv:2110.13492v5 [cs.LG] UPDATED)
    We introduce a block-online variant of the temporal feature-wise linear modulation (TFiLM) model to achieve bandwidth extension. The proposed architecture simplifies the UNet backbone of the TFiLM to reduce inference time and employs an efficient transformer at the bottleneck to alleviate performance degradation. We also utilize self-supervised pretraining and data augmentation to enhance the quality of bandwidth extended signals and reduce the sensitivity with respect to downsampling methods. Experiment results on the VCTK dataset show that the proposed method outperforms several recent baselines in both intrusive and non-intrusive metrics. Pretraining and filter augmentation also help stabilize and enhance the overall performance.
    Federated Spatial Reuse Optimization in Next-Generation Decentralized IEEE 802.11 WLANs. (arXiv:2203.10472v2 [cs.NI] UPDATED)
    As wireless standards evolve, more complex functionalities are introduced to address the increasing requirements in terms of throughput, latency, security, and efficiency. To unleash the potential of such new features, artificial intelligence (AI) and machine learning (ML) are currently being exploited for deriving models and protocols from data, rather than by hand-programming. In this paper, we explore the feasibility of applying ML in next-generation wireless local area networks (WLANs). More specifically, we focus on the IEEE 802.11ax spatial reuse (SR) problem and predict its performance through federated learning (FL) models. The set of FL solutions overviewed in this work is part of the 2021 International Telecommunication Union (ITU) AI for 5G Challenge.
    Reachability Constrained Reinforcement Learning. (arXiv:2205.07536v2 [cs.LG] UPDATED)
    Constrained reinforcement learning (CRL) has gained significant interest recently, since safety constraints satisfaction is critical for real-world problems. However, existing CRL methods constraining discounted cumulative costs generally lack rigorous definition and guarantee of safety. In contrast, in the safe control research, safety is defined as persistently satisfying certain state constraints. Such persistent safety is possible only on a subset of the state space, called feasible set, where an optimal largest feasible set exists for a given environment. Recent studies incorporate feasible sets into CRL with energy-based methods such as control barrier function (CBF), safety index (SI), and leverage prior conservative estimations of feasible sets, which harms the performance of the learned policy. To deal with this problem, this paper proposes the reachability CRL (RCRL) method by using reachability analysis to establish the novel self-consistency condition and characterize the feasible sets. The feasible sets are represented by the safety value function, which is used as the constraint in CRL. We use the multi-time scale stochastic approximation theory to prove that the proposed algorithm converges to a local optimum, where the largest feasible set can be guaranteed. Empirical results on different benchmarks validate the learned feasible set, the policy performance, and constraint satisfaction of RCRL, compared to CRL and safe control baselines.
    Harnessing spectral representations for subgraph alignment. (arXiv:2205.14938v2 [cs.LG] UPDATED)
    With the rise and advent of graph learning techniques, graph data has become ubiquitous. However, while several efforts are being devoted to the design of new convolutional architectures, pooling or positional encoding schemes, less effort is being spent on problems involving maps between (possibly very large) graphs, such as signal transfer, graph isomorphism and subgraph correspondence. With this paper, we anticipate the need for a convenient framework to deal with such problems, and focus in particular on the challenging subgraph alignment scenario. We claim that, first and foremost, the representation of a map plays a central role on how these problems should be modeled. Taking the hint from recent work in geometry processing, we propose the adoption of a spectral representation for maps that is compact, easy to compute, robust to topological changes, easy to plug into existing pipelines, and is especially effective for subgraph alignment problems. We report for the first time a surprising phenomenon where the partiality arising in the subgraph alignment task is manifested as a special structure of the map coefficients, even in the absence of exact subgraph isomorphism, and which is consistently observed over different families of graphs up to several thousand nodes.
    GradMax: Growing Neural Networks using Gradient Information. (arXiv:2201.05125v3 [cs.LG] UPDATED)
    The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.
    Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next. (arXiv:2201.05624v4 [cs.LG] UPDATED)
    Physics-Informed Neural Networks (PINN) are neural networks (NNs) that encode model equations, like Partial Differential Equations (PDE), as a component of the neural network itself. PINNs are nowadays used to solve PDEs, fractional equations, integral-differential equations, and stochastic PDEs. This novel methodology has arisen as a multi-task learning framework in which a NN must fit observed data while reducing a PDE residual. This article provides a comprehensive review of the literature on PINNs: while the primary goal of the study was to characterize these networks and their related advantages and disadvantages. The review also attempts to incorporate publications on a broader range of collocation-based physics informed neural networks, which stars form the vanilla PINN, as well as many other variants, such as physics-constrained neural networks (PCNN), variational hp-VPINN, and conservative PINN (CPINN). The study indicates that most research has focused on customizing the PINN through different activation functions, gradient optimization techniques, neural network structures, and loss function structures. Despite the wide range of applications for which PINNs have been used, by demonstrating their ability to be more feasible in some contexts than classical numerical techniques like Finite Element Method (FEM), advancements are still possible, most notably theoretical issues that remain unresolved.
    Consistency Regularization for Variational Auto-Encoders. (arXiv:2105.14859v2 [cs.LG] UPDATED)
    Variational auto-encoders (VAEs) are a powerful approach to unsupervised learning. They enable scalable approximate posterior inference in latent-variable models using variational inference (VI). A VAE posits a variational family parameterized by a deep neural network called an encoder that takes data as input. This encoder is shared across all the observations, which amortizes the cost of inference. However the encoder of a VAE has the undesirable property that it maps a given observation and a semantics-preserving transformation of it to different latent representations. This "inconsistency" of the encoder lowers the quality of the learned representations, especially for downstream tasks, and also negatively affects generalization. In this paper, we propose a regularization method to enforce consistency in VAEs. The idea is to minimize the Kullback-Leibler (KL) divergence between the variational distribution when conditioning on the observation and the variational distribution when conditioning on a random semantic-preserving transformation of this observation. This regularization is applicable to any VAE. In our experiments we apply it to four different VAE variants on several benchmark datasets and found it always improves the quality of the learned representations but also leads to better generalization. In particular, when applied to the Nouveau Variational Auto-Encoder (NVAE), our regularization method yields state-of-the-art performance on MNIST and CIFAR-10. We also applied our method to 3D data and found it learns representations of superior quality as measured by accuracy on a downstream classification task.
    CANShield: Signal-based Intrusion Detection for Controller Area Networks. (arXiv:2205.01306v3 [cs.CR] UPDATED)
    Modern vehicles rely on a fleet of electronic control units (ECUs) connected through controller area network (CAN) buses for critical vehicular control. However, with the expansion of advanced connectivity features in automobiles and the elevated risks of internal system exposure, the CAN bus is increasingly prone to intrusions and injection attacks. The ordinary injection attacks disrupt the typical timing properties of the CAN data stream, and the rule-based intrusion detection systems (IDS) can easily detect them. However, advanced attackers can inject false data to the time series sensory data (signal), while looking innocuous by the pattern/frequency of the CAN messages. Such attacks can bypass the rule-based IDS or any anomaly-based IDS built on binary payload data. To make the vehicles robust against such intelligent attacks, we propose CANShield, a signal-based intrusion detection framework for the CAN bus. CANShield consists of three modules: a data preprocessing module that handles the high-dimensional CAN data stream at the signal level and makes them suitable for a deep learning model; a data analyzer module consisting of multiple deep autoencoder (AE) networks, each analyzing the time-series data from a different temporal perspective; and finally an attack detection module that uses an ensemble method to make the final decision. Evaluation results on two high-fidelity signal-based CAN attack datasets show the high accuracy and responsiveness of CANShield in detecting wide-range of advanced intrusion attacks.
    DeepMTS: Deep Multi-task Learning for Survival Prediction in Patients with Advanced Nasopharyngeal Carcinoma using Pretreatment PET/CT. (arXiv:2109.07711v2 [eess.IV] UPDATED)
    Nasopharyngeal Carcinoma (NPC) is a malignant epithelial cancer arising from the nasopharynx. Survival prediction is a major concern for NPC patients, as it provides early prognostic information to plan treatments. Recently, deep survival models based on deep learning have demonstrated the potential to outperform traditional radiomics-based survival prediction models. Deep survival models usually use image patches covering the whole target regions (e.g., nasopharynx for NPC) or containing only segmented tumor regions as the input. However, the models using the whole target regions will also include non-relevant background information, while the models using segmented tumor regions will disregard potentially prognostic information existing out of primary tumors (e.g., local lymph node metastasis and adjacent tissue invasion). In this study, we propose a 3D end-to-end Deep Multi-Task Survival model (DeepMTS) for joint survival prediction and tumor segmentation in advanced NPC from pretreatment PET/CT. Our novelty is the introduction of a hard-sharing segmentation backbone to guide the extraction of local features related to the primary tumors, which reduces the interference from non-relevant background information. In addition, we also introduce a cascaded survival network to capture the prognostic information existing out of primary tumors and further leverage the global tumor information (e.g., tumor size, shape, and locations) derived from the segmentation backbone. Our experiments with two clinical datasets demonstrate that our DeepMTS can consistently outperform traditional radiomics-based survival prediction models and existing deep survival models.
    Learning in High-Dimensional Feature Spaces Using ANOVA-Based Fast Matrix-Vector Multiplication. (arXiv:2111.10140v2 [cs.LG] UPDATED)
    Kernel matrices are crucial in many learning tasks such as support vector machines or kernel ridge regression. The kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all of its entries in reasonable time becomes a challenging task. For such dense matrices the cost of a matrix-vector product scales quadratically with the dimensionality N , if no customized methods are applied. We propose the use of an ANOVA kernel, where we construct several kernels based on lower-dimensional feature spaces for which we provide fast algorithms realizing the matrix-vector products. We employ the non-equispaced fast Fourier transform (NFFT), which is of linear complexity for fixed accuracy. Based on a feature grouping approach, we then show how the fast matrix-vector products can be embedded into a learning method choosing kernel ridge regression and the conjugate gradient solver. We illustrate the performance of our approach on several data sets.
    Identifiability of Causal-based Fairness Notions: A State of the Art. (arXiv:2203.05900v2 [cs.LG] UPDATED)
    Machine learning algorithms can produce biased outcome/prediction, typically, against minorities and under-represented sub-populations. Therefore, fairness is emerging as an important requirement for the large scale application of machine learning based technologies. The most commonly used fairness notions (e.g. statistical parity, equalized odds, predictive parity, etc.) are observational and rely on mere correlation between variables. These notions fail to identify bias in case of statistical anomalies such as Simpson's or Berkson's paradoxes. Causality-based fairness notions (e.g. counterfactual fairness, no-proxy discrimination, etc.) are immune to such anomalies and hence more reliable to assess fairness. The problem of causality-based fairness notions, however, is that they are defined in terms of quantities (e.g. causal, counterfactual, and path-specific effects) that are not always measurable. This is known as the identifiability problem and is the topic of a large body of work in the causal inference literature. This paper is a compilation of the major identifiability results which are of particular relevance for machine learning fairness. The results are illustrated using a large number of examples and causal graphs. The paper would be of particular interest to fairness researchers, practitioners, and policy makers who are considering the use of causality-based fairness notions as it summarizes and illustrates the major identifiability results
    Improved Cardiac Arrhythmia Prediction Based on Heart Rate Variability Analysis. (arXiv:2206.03222v1 [cs.LG])
    Many types of ventricular and atrial cardiac arrhythmias have been discovered in clinical practice in the past 100 years, and these arrhythmias are a major contributor to sudden cardiac death. Ventricular tachycardia, ventricular fibrillation, and paroxysmal atrial fibrillation are the most commonly-occurring and dangerous arrhythmias, therefore early detection is crucial to prevent any further complications and reduce fatalities. Implantable devices such as pacemakers are commonly used in patients at high risk of sudden cardiac death. While great advances have been made in medical technology, there remain significant challenges in effective management of common arrhythmias. This thesis proposes novel arrhythmia detection and prediction methods to differentiate cardiac arrhythmias from non-life-threatening cardiac events, to increase the likelihood of detecting events that may lead to mortality, as well as reduce the incidence of unnecessary therapeutic intervention. The methods are based on detailed analysis of Heart Rate Variability (HRV) information. The results of the work show good performance of the proposed methods and support the potential for their deployment in resource-constrained devices for ventricular and atrial arrhythmia prediction, such as implantable pacemakers and defibrillators.
    Debiased Self-Training for Semi-Supervised Learning. (arXiv:2202.07136v3 [cs.LG] UPDATED)
    Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. Yet these datasets are time-consuming and labor-exhaustive to obtain on realistic tasks. To mitigate the requirement for labeled data, self-training is widely used in semi-supervised learning by iteratively assigning pseudo labels to unlabeled samples. Despite its popularity, self-training is well-believed to be unreliable and often leads to training instability. Our experimental studies further reveal that the bias in semi-supervised learning arises from both the problem itself and the inappropriate training with potentially incorrect pseudo labels, which accumulates the error in the iterative self-training process. To reduce the above bias, we propose Debiased Self-Training (DST). First, the generation and utilization of pseudo labels are decoupled by two parameter-independent classifier heads to avoid direct error accumulation. Second, we estimate the worst case of self-training bias, where the pseudo labeling function is accurate on labeled samples, yet makes as many mistakes as possible on unlabeled samples. We then adversarially optimize the representations to improve the quality of pseudo labels by avoiding the worst case. Extensive experiments justify that DST achieves an average improvement of 6.3% against state-of-the-art methods on standard semi-supervised learning benchmark datasets and 18.9%$ against FixMatch on 13 diverse tasks. Furthermore, DST can be seamlessly adapted to other self-training methods and help stabilize their training and balance performance across classes in both cases of training from scratch and finetuning from pre-trained models.
    Adaptive Weighted Nonnegative Matrix Factorization for Robust Feature Representation. (arXiv:2206.03020v1 [cs.LG])
    Nonnegative matrix factorization (NMF) has been widely used to dimensionality reduction in machine learning. However, the traditional NMF does not properly handle outliers, so that it is sensitive to noise. In order to improve the robustness of NMF, this paper proposes an adaptive weighted NMF, which introduces weights to emphasize the different importance of each data point, thus the algorithmic sensitivity to noisy data is decreased. It is very different from the existing robust NMFs that use a slow growth similarity measure. Specifically, two strategies are proposed to achieve this: fuzzier weighted technique and entropy weighted regularized technique, and both of them lead to an iterative solution with a simple form. Experimental results showed that new methods have more robust feature representation on several real datasets with noise than exsiting methods.
    Joint Manifold Learning and Density Estimation Using Normalizing Flows. (arXiv:2206.03293v1 [cs.LG])
    Based on the manifold hypothesis, real-world data often lie on a low-dimensional manifold, while normalizing flows as a likelihood-based generative model are incapable of finding this manifold due to their structural constraints. So, one interesting question arises: $\textit{"Can we find sub-manifold(s) of data in normalizing flows and estimate the density of the data on the sub-manifold(s)?"}$. In this paper, we introduce two approaches, namely per-pixel penalized log-likelihood and hierarchical training, to answer the mentioned question. We propose a single-step method for joint manifold learning and density estimation by disentangling the transformed space obtained by normalizing flows to manifold and off-manifold parts. This is done by a per-pixel penalized likelihood function for learning a sub-manifold of the data. Normalizing flows assume the transformed data is Gaussianizationed, but this imposed assumption is not necessarily true, especially in high dimensions. To tackle this problem, a hierarchical training approach is employed to improve the density estimation on the sub-manifold. The results validate the superiority of the proposed methods in simultaneous manifold learning and density estimation using normalizing flows in terms of generated image quality and likelihood.
    Specification-Guided Learning of Nash Equilibria with High Social Welfare. (arXiv:2206.03348v1 [cs.GT])
    Reinforcement learning has been shown to be an effective strategy for automatically training policies for challenging control problems. Focusing on non-cooperative multi-agent systems, we propose a novel reinforcement learning framework for training joint policies that form a Nash equilibrium. In our approach, rather than providing low-level reward functions, the user provides high-level specifications that encode the objective of each agent. Then, guided by the structure of the specifications, our algorithm searches over policies to identify one that provably forms an $\epsilon$-Nash equilibrium (with high probability). Importantly, it prioritizes policies in a way that maximizes social welfare across all agents. Our empirical evaluation demonstrates that our algorithm computes equilibrium policies with high social welfare, whereas state-of-the-art baselines either fail to compute Nash equilibria or compute ones with comparatively lower social welfare.
    Data Stealing Attack on Medical Images: Is it Safe to Export Networks from Data Lakes?. (arXiv:2206.03391v1 [cs.CR])
    In privacy-preserving machine learning, it is common that the owner of the learned model does not have any physical access to the data. Instead, only a secured remote access to a data lake is granted to the model owner without any ability to retrieve data from the data lake. Yet, the model owner may want to export the trained model periodically from the remote repository and a question arises whether this may cause is a risk of data leakage. In this paper, we introduce the concept of data stealing attack during the export of neural networks. It consists in hiding some information in the exported network that allows the reconstruction outside the data lake of images initially stored in that data lake. More precisely, we show that it is possible to train a network that can perform lossy image compression and at the same time solve some utility tasks such as image segmentation. The attack then proceeds by exporting the compression decoder network together with some image codes that leads to the image reconstruction outside the data lake. We explore the feasibility of such attacks on databases of CT and MR images, showing that it is possible to obtain perceptually meaningful reconstructions of the target dataset, and that the stolen dataset can be used in turns to solve a broad range of tasks. Comprehensive experiments and analyses show that data stealing attacks should be considered as a threat for sensitive imaging data sources.
    Plug & Play Attacks: Towards Robust and Flexible Model Inversion Attacks. (arXiv:2201.12179v3 [cs.LG] UPDATED)
    Model inversion attacks (MIAs) aim to create synthetic images that reflect the class-wise characteristics from a target classifier's private training data by exploiting the model's learned knowledge. Previous research has developed generative MIAs that use generative adversarial networks (GANs) as image priors tailored to a specific target model. This makes the attacks time- and resource-consuming, inflexible, and susceptible to distributional shifts between datasets. To overcome these drawbacks, we present Plug & Play Attacks, which relax the dependency between the target model and image prior, and enable the use of a single GAN to attack a wide range of targets, requiring only minor adjustments to the attack. Moreover, we show that powerful MIAs are possible even with publicly available pre-trained GANs and under strong distributional shifts, for which previous approaches fail to produce meaningful results. Our extensive evaluation confirms the improved robustness and flexibility of Plug & Play Attacks and their ability to create high-quality images revealing sensitive class characteristics.
    Towards Meta-learned Algorithm Selection using Implicit Fidelity Information. (arXiv:2206.03130v1 [cs.LG])
    Automatically selecting the best performing algorithm for a given dataset or ranking multiple of them by their expected performance supports users in developing new machine learning applications. Most approaches for this problem rely on dataset meta-features and landmarking performances to capture the salient topology of the datasets and those topologies that the algorithms attend to. Landmarking usually exploits cheap algorithms not necessarily in the pool of candidate algorithms to get inexpensive approximations of the topology. While somewhat indicative, handcrafted dataset meta-features and landmarks are likely insufficient descriptors, strongly depending on the alignment of the geometries the landmarks and candidates search for. We propose IMFAS, a method to exploit multi-fidelity landmarking information directly from the candidate algorithms in the form of non-parametrically non-myopic meta-learned learning curves via LSTM networks in a few-shot setting during testing. Using this mechanism, IMFAS jointly learns the topology of of the datasets and the inductive biases of algorithms without expensively training them to convergence. IMFAS produces informative landmarks, easily enriched by arbitrary meta-features at a low computational cost, capable of producing the desired ranking using cheaper fidelities. We additionally show that it is able to beat Successive Halving with at most half the fidelity sequence during test time
    Efficient and Accurate Physics-aware Multiplex Graph Neural Networks for 3D Small Molecules and Macromolecule Complexes. (arXiv:2206.02789v1 [q-bio.BM])
    Recent advances in applying Graph Neural Networks (GNNs) to molecular science have showcased the power of learning three-dimensional (3D) structure representations with GNNs. However, most existing GNNs suffer from the limitations of insufficient modeling of diverse interactions, computational expensive operations, and ignorance of vectorial values. Here, we tackle these limitations by proposing a novel GNN model, Physics-aware Multiplex Graph Neural Network (PaxNet), to efficiently and accurately learn the representations of 3D molecules for both small organic compounds and macromolecule complexes. PaxNet separates the modeling of local and non-local interactions inspired by molecular mechanics, and reduces the expensive angle-related computations. Besides scalar properties, PaxNet can also predict vectorial properties by learning an associated vector for each atom. To evaluate the performance of PaxNet, we compare it with state-of-the-art baselines in two tasks. On small molecule dataset for predicting quantum chemical properties, PaxNet reduces the prediction error by 15% and uses 73% less memory than the best baseline. On macromolecule dataset for predicting protein-ligand binding affinities, PaxNet outperforms the best baseline while reducing the memory consumption by 33% and the inference time by 85%. Thus, PaxNet provides a universal, robust and accurate method for large-scale machine learning of molecules.
    On Outer Bi-Lipschitz Extensions of Linear Johnson-Lindenstrauss Embeddings of Low-Dimensional Submanifolds of $\mathbb{R}^N$. (arXiv:2206.03376v1 [math.NA])
    Let $\mathcal{M}$ be a compact $d$-dimensional submanifold of $\mathbb{R}^N$ with reach $\tau$ and volume $V_{\mathcal M}$. Fix $\epsilon \in (0,1)$. In this paper we prove that a nonlinear function $f: \mathbb{R}^N \rightarrow \mathbb{R}^{m}$ exists with $m \leq C \left(d / \epsilon^2 \right) \log \left(\frac{\sqrt[d]{V_{\mathcal M}}}{\tau} \right)$ such that $$(1 - \epsilon) \| {\bf x} - {\bf y} \|_2 \leq \left\| f({\bf x}) - f({\bf y}) \right\|_2 \leq (1 + \epsilon) \| {\bf x} - {\bf y} \|_2$$ holds for all ${\bf x} \in \mathcal{M}$ and ${\bf y} \in \mathbb{R}^N$. In effect, $f$ not only serves as a bi-Lipschitz function from $\mathcal{M}$ into $\mathbb{R}^{m}$ with bi-Lipschitz constants close to one, but also approximately preserves all distances from points not in $\mathcal{M}$ to all points in $\mathcal{M}$ in its image. Furthermore, the proof is constructive and yields an algorithm which works well in practice. In particular, it is empirically demonstrated herein that such nonlinear functions allow for more accurate compressive nearest neighbor classification than standard linear Johnson-Lindenstrauss embeddings do in practice.
    Adaptive Regularization for Adversarial Training. (arXiv:2206.03353v1 [stat.ML])
    Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to use a data-adaptive regularization for robustifying a prediction model. We apply more regularization to data which are more vulnerable to adversarial attacks and vice versa. Even though the idea of data-adaptive regularization is not new, our data-adaptive regularization has a firm theoretical base of reducing an upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on clean samples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.
    Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks. (arXiv:2206.02887v1 [cs.LG])
    We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.
    Schema-Guided Event Graph Completion. (arXiv:2206.02921v1 [cs.LG])
    We tackle a new task, event graph completion, which aims to predict missing event nodes for event graphs. Existing link prediction or graph completion methods have difficulty dealing with event graphs because they are usually designed for a single large graph such as a social network or a knowledge graph, rather than multiple small dynamic event graphs. Moreover, they can only predict missing edges rather than missing nodes. In this work, we propose to utilize event schema, a template that describes the stereotypical structure of event graphs, to address the above issues. Our schema-guided event graph completion approach first maps an instance event graph to a subgraph of the schema graph by a heuristic subgraph matching algorithm. Then it predicts whether a candidate event node in the schema graph should be added to the instantiated schema subgraph by characterizing two types of local topology of the schema graph: neighbors of the candidate node and the subgraph, and paths that connect the candidate node and the subgraph. These two modules are later combined together for the final prediction. We also propose a self-supervised strategy to construct training samples, as well as an inference algorithm that is specifically designed to complete event graphs. Extensive experimental results on four datasets demonstrate that our proposed method achieves state-of-the-art performance, with 4.3% to 19.4% absolute F1 gains over the best baseline method on the four datasets.
    The Survival Bandit Problem. (arXiv:2206.03019v1 [cs.LG])
    We study the survival bandit problem, a variant of the multi-armed bandit problem introduced in an open problem by Perotto et al. (2019), with a constraint on the cumulative reward; at each time step, the agent receives a (possibly negative) reward and if the cumulative reward becomes lower than a prespecified threshold, the procedure stops, and this phenomenon is called ruin. This is the first paper studying a framework where the ruin might occur but not always. We first discuss that a sublinear regret is unachievable under a naive definition of the regret. Next, we provide tight lower bounds on the probability of ruin (as well as matching policies). Based on this lower bound, we define the survival regret as an objective to minimize and provide a policy achieving a sublinear survival regret (at least in the case of integral rewards) when the time horizon $T$ is known.
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v3 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.
    Dual Decomposition of Convex Optimization Layers for Consistent Attention in Medical Images. (arXiv:2206.02761v2 [cs.CV] UPDATED)
    A key concern in integrating machine learning models in medicine is the ability to interpret their reasoning. Popular explainability methods have demonstrated satisfactory results in natural image recognition, yet in medical image analysis, many of these approaches provide partial and noisy explanations. Recently, attention mechanisms have shown compelling results both in their predictive performance and in their interpretable qualities. A fundamental trait of attention is that it leverages salient parts of the input which contribute to the model's prediction. To this end, our work focuses on the explanatory value of attention weight distributions. We propose a multi-layer attention mechanism that enforces consistent interpretations between attended convolutional layers using convex optimization. We apply duality to decompose the consistency constraints between the layers by reparameterizing their attention probability distributions. We further suggest learning the dual witness by optimizing with respect to our objective; thus, our implementation uses standard back-propagation, hence it is highly efficient. While preserving predictive performance, our proposed method leverages weakly annotated medical imaging data and provides complete and faithful explanations to the model's prediction.
    Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime. (arXiv:2206.02927v1 [stat.ML])
    We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
    Improving Knowledge Graph Embedding via Iterative Self-Semantic Knowledge Distillation. (arXiv:2206.02963v1 [cs.LG])
    Knowledge graph embedding (KGE) has been intensively investigated for link prediction by projecting entities and relations into continuous vector spaces. Current popular high-dimensional KGE methods obtain quite slight performance gains while require enormous computation and memory costs. In contrast to high-dimensional KGE models, training low-dimensional models is more efficient and worthwhile for better deployments to practical intelligent systems. However, the model expressiveness of semantic information in knowledge graphs (KGs) is highly limited in the low dimension parameter space. In this paper, we propose iterative self-semantic knowledge distillation strategy to improve the KGE model expressiveness in the low dimension space. KGE model combined with our proposed strategy plays the teacher and student roles alternatively during the whole training process. Specifically, at a certain iteration, the model is regarded as a teacher to provide semantic information for the student. At next iteration, the model is regard as a student to incorporate the semantic information transferred from the teacher. We also design a novel semantic extraction block to extract iteration-based semantic information for the training model self-distillation. Iteratively incorporating and accumulating iteration-based semantic information enables the low-dimensional model to be more expressive for better link prediction in KGs. There is only one model during the whole training, which alleviates the increase of computational expensiveness and memory requirements. Furthermore, the proposed strategy is model-agnostic and can be seamlessly combined with other KGE models. Consistent and significant performance gains in experimental evaluations on four standard datasets demonstrate the effectiveness of the proposed self-distillation strategy.
    Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL. (arXiv:2206.02380v2 [cs.LG] UPDATED)
    Model-based reinforcement learning promises to learn an optimal policy from fewer interactions with the environment compared to model-free reinforcement learning by learning an intermediate model of the environment in order to predict future interactions. When predicting a sequence of interactions, the rollout length, which limits the prediction horizon, is a critical hyperparameter as accuracy of the predictions diminishes in the regions that are further away from real experience. As a result, with a longer rollout length, an overall worse policy is learned in the long run. Thus, the hyperparameter provides a trade-off between quality and efficiency. In this work, we frame the problem of tuning the rollout length as a meta-level sequential decision-making problem that optimizes the final policy learned by model-based reinforcement learning given a fixed budget of environment interactions by adapting the hyperparameter dynamically based on feedback from the learning process, such as accuracy of the model and the remaining budget of interactions. We use model-free deep reinforcement learning to solve the meta-level decision problem and demonstrate that our approach outperforms common heuristic baselines on two well-known reinforcement learning environments.
    Bump Hunting in Latent Space. (arXiv:2103.06595v2 [hep-ph] UPDATED)
    Unsupervised anomaly detection could be crucial in future analyses searching for rare phenomena in large datasets, as for example collected at the LHC. To this end, we introduce a physics inspired variational autoencoder (VAE) architecture which performs competitively and robustly on the LHC Olympics Machine Learning Challenge datasets. We demonstrate how embedding some physical observables directly into the VAE latent space, while at the same time keeping the classifier manifestly agnostic to them, can help to identify and characterise features in measured spectra as caused by the presence of anomalies in a dataset.
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v3 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex), under an interpolation regime. At the heart of our analysis is a new generalization error bound for deterministic symmetric algorithms, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, Polyak-Lojasiewicz (PL), convex and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, under the proper choice of a decreasing step size. Further, if the loss is nonconvex but the objective is PL, we derive quadratically vanishing bounds on the generalization error and the corresponding excess risk, for a choice of a large constant step size. For (resp. strongly-) convex smooth losses, we prove that full-batch GD also generalizes for large constant step sizes, and achieves (resp. quadratically) small excess risk while training fast. In all cases, we close the generalization error gap, by showing matching generalization and optimization error rates. Our full-batch GD generalization error and excess risk bounds are strictly tighter than existing bounds for (stochastic) GD, when the loss is smooth (but possibly non-Lipschitz).
    On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning. (arXiv:2111.05992v2 [cs.LG] UPDATED)
    The creation and destruction of agents in cooperative multi-agent reinforcement learning (MARL) is a critically under-explored area of research. Current MARL algorithms often assume that the number of agents within a group remains fixed throughout an experiment. However, in many practical problems, an agent may terminate before their teammates. This early termination issue presents a challenge: the terminated agent must learn from the group's success or failure which occurs beyond its own existence. We refer to propagating value from rewards earned by remaining teammates to terminated agents as the Posthumous Credit Assignment problem. Current MARL methods handle this problem by placing these agents in an absorbing state until the entire group of agents reaches a termination condition. Although absorbing states enable existing algorithms and APIs to handle terminated agents without modification, practical training efficiency and resource use problems exist. In this work, we first demonstrate that sample complexity increases with the quantity of absorbing states in a toy supervised learning task for a fully connected network, while attention is more robust to variable size input. Then, we present a novel architecture for an existing state-of-the-art MARL algorithm which uses attention instead of a fully connected layer with absorbing states. Finally, we demonstrate that this novel architecture significantly outperforms the standard architecture on tasks in which agents are created or destroyed within episodes as well as standard multi-agent coordination tasks.
    Adversarial Bandits Robust to $S$-Switch Regret. (arXiv:2205.14839v2 [cs.LG] UPDATED)
    We study the adversarial bandit problem under $S$ number of switching best arms for unknown $S$. For handling this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with basic OMD, achieving $\tilde{O}(S^{1/2}K^{1/3}T^{2/3})$. For improving the regret bound with respect to $T$, we propose to use adaptive learning rates for OMD to control variance of loss estimators, and achieve $\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_T(h^\dagger)}],S\sqrt{KT}\})$, where $\rho_T(h^\dagger)$ is a variance term for loss estimators.
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v2 [cs.LG] UPDATED)
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. Surprisingly, we show that with no communication at all, such guarantees are not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some values of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author. As there, our algorithm succeeds even when feedback upon collision can be corrupted by an adaptive adversary, thanks to a strong no-collision property. Our lower bound is based on topological obstructions at multiple scales and is completely new.
    Reachability In Simple Neural Networks. (arXiv:2203.07941v2 [cs.CC] UPDATED)
    We investigate the complexity of the reachability problem for (deep) neural networks: does it compute valid output given some valid input? It was recently claimed that the problem is NP-complete for general neural networks and specifications over the input/output dimension given by conjunctions of linear inequalities. We recapitulate the proof and repair some flaws in the original upper and lower bound proofs. Motivated by the general result, we show that NP-hardness already holds for restricted classes of simple specifications and neural networks. Allowing for a single hidden layer and an output dimension of one as well as neural networks with just one negative, zero and one positive weight or bias is sufficient to ensure NP-hardness. Additionally, we give a thorough discussion and outlook of possible extensions for this direction of research on neural network verification.
    Few-Shot Learning on Graphs. (arXiv:2203.09308v2 [cs.LG] UPDATED)
    Graph representation learning has attracted tremendous attention due to its remarkable performance in many real-world applications. However, prevailing supervised graph representation learning models for specific tasks often suffer from label sparsity issue as data labeling is always time and resource consuming. In light of this, few-shot learning on graphs (FSLG), which combines the strengths of graph representation learning and few-shot learning together, has been proposed to tackle the performance degradation in face of limited annotated data challenge. There have been many studies working on FSLG recently. In this paper, we comprehensively survey these work in the form of a series of methods and applications. Specifically, we first introduce FSLG challenges and bases, then categorize and summarize existing work of FSLG in terms of three major graph mining tasks at different granularity levels, i.e., node, edge, and graph. Finally, we share our thoughts on some future research directions of FSLG. The authors of this survey have contributed significantly to the AI literature on FSLG over the last few years.
    KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction. (arXiv:2206.03364v1 [q-bio.BM])
    Designing accurate deep learning models for molecular property prediction plays an increasingly essential role in drug and material discovery. Recently, due to the scarcity of labeled molecules, self-supervised learning methods for learning generalizable and transferable representations of molecular graphs have attracted lots of attention. In this paper, we argue that there exist two major issues hindering current self-supervised learning methods from obtaining desired performance on molecular property prediction, that is, the ill-defined pre-training tasks and the limited model capacity. To this end, we introduce Knowledge-guided Pre-training of Graph Transformer (KPGT), a novel self-supervised learning framework for molecular graph representation learning, to alleviate the aforementioned issues and improve the performance on the downstream molecular property prediction tasks. More specifically, we first introduce a high-capacity model, named Line Graph Transformer (LiGhT), which emphasizes the importance of chemical bonds and is mainly designed to model the structural information of molecular graphs. Then, a knowledge-guided pre-training strategy is proposed to exploit the additional knowledge of molecules to guide the model to capture the abundant structural and semantic information from large-scale unlabeled molecular graphs. Extensive computational tests demonstrated that KPGT can offer superior performance over current state-of-the-art methods on several molecular property prediction tasks.
    Gender Bias in Word Embeddings: A Comprehensive Analysis of Frequency, Syntax, and Semantics. (arXiv:2206.03390v1 [cs.CY])
    The statistical regularities in language corpora encode well-known social biases into word embeddings. Here, we focus on gender to provide a comprehensive analysis of group-based biases in widely-used static English word embeddings trained on internet corpora (GloVe 2014, fastText 2017). Using the Single-Category Word Embedding Association Test, we demonstrate the widespread prevalence of gender biases that also show differences in: (1) frequencies of words associated with men versus women; (b) part-of-speech tags in gender-associated words; (c) semantic categories in gender-associated words; and (d) valence, arousal, and dominance in gender-associated words. First, in terms of word frequency: we find that, of the 1,000 most frequent words in the vocabulary, 77% are more associated with men than women, providing direct evidence of a masculine default in the everyday language of the English-speaking world. Second, turning to parts-of-speech: the top male-associated words are typically verbs (e.g., fight, overpower) while the top female-associated words are typically adjectives and adverbs (e.g., giving, emotionally). Gender biases in embeddings also permeate parts-of-speech. Third, for semantic categories: bottom-up, cluster analyses of the top 1,000 words associated with each gender. The top male-associated concepts include roles and domains of big tech, engineering, religion, sports, and violence; in contrast, the top female-associated concepts are less focused on roles, including, instead, female-specific slurs and sexual content, as well as appearance and kitchen terms. Fourth, using human ratings of word valence, arousal, and dominance from a ~20,000 word lexicon, we find that male-associated words are higher on arousal and dominance, while female-associated words are higher on valence.
    Task-aware Privacy Preservation for Multi-dimensional Data. (arXiv:2110.02329v2 [cs.CR] UPDATED)
    Local differential privacy (LDP) can be adopted to anonymize richer user data attributes that will be input to sophisticated machine learning (ML) tasks. However, today's LDP approaches are largely task-agnostic and often lead to severe performance loss -- they simply inject noise to all data attributes according to a given privacy budget, regardless of what features are most relevant for the ultimate task. In this paper, we address how to significantly improve the ultimate task performance with multi-dimensional user data by considering a task-aware privacy preservation problem. The key idea is to use an encoder-decoder framework to learn (and anonymize) a task-relevant latent representation of user data. We obtain an analytical near-optimal solution for the linear setting with mean-squared error (MSE) task loss. We also provide an approximate solution through a gradient-based learning algorithm for general nonlinear cases. Extensive experiments demonstrate that our task-aware approach significantly improves ultimate task accuracy compared to standard benchmark LDP approaches with the same level of privacy guarantee.
    Reweighing auxiliary losses in supervised learning. (arXiv:2202.03250v2 [cs.LG] UPDATED)
    Apart from the standard supervised learning using hard labels, often auxiliary losses are used in many supervised learning settings to improve the model's generalisation. For example, knowledge distillation adds a second, teacher mimicking loss to the training of a model, where the teacher may be a pretrained model that outputs a richer distribution over labels. Similarly, in settings with limited labelled data, weak labelling information is used in form of labelling functions. Auxiliary losses are introduced here to combat labelling functions that may be noisy rule-based approximations of true labels. We tackle the problem of learning to combine these losses in a principled manner. We introduce AMAL which learns instance-specific weights using meta learning on a validation metric to achieve optimal mixing of losses. Experiments in a number of knowledge distillation and rule denoising domains show that AMAL provides noticeable gains over competitive baselines in those domains. We empirically analyze our method and share insights into the mechanisms through which it provides performance gains.
    Physics-Inspired Temporal Learning of Quadrotor Dynamics for Accurate Model Predictive Trajectory Tracking. (arXiv:2206.03305v1 [cs.RO])
    Accurately modeling quadrotor's system dynamics is critical for guaranteeing agile, safe, and stable navigation. The model needs to capture the system behavior in multiple flight regimes and operating conditions, including those producing highly nonlinear effects such as aerodynamic forces and torques, rotor interactions, or possible system configuration modifications. Classical approaches rely on handcrafted models and struggle to generalize and scale to capture these effects. In this paper, we present a novel Physics-Inspired Temporal Convolutional Network (PI-TCN) approach to learning quadrotor's system dynamics purely from robot experience. Our approach combines the expressive power of sparse temporal convolutions and dense feed-forward connections to make accurate system predictions. In addition, physics constraints are embedded in the training process to facilitate the network's generalization capabilities to data outside the training distribution. Finally, we design a model predictive control approach that incorporates the learned dynamics for accurate closed-loop trajectory tracking fully exploiting the learned model predictions in a receding horizon fashion. Experimental results demonstrate that our approach accurately extracts the structure of the quadrotor's dynamics from data, capturing effects that would remain hidden to classical approaches. To the best of our knowledge, this is the first time physics-inspired deep learning is successfully applied to temporal convolutional networks and to the system identification task, while concurrently enabling predictive control.
    Few-Shot Learning by Dimensionality Reduction in Gradient Space. (arXiv:2206.03483v1 [cs.LG])
    We introduce SubGD, a novel few-shot learning method which is based on the recent finding that stochastic gradient descent updates tend to live in a low-dimensional parameter subspace. In experimental and theoretical analyses, we show that models confined to a suitable predefined subspace generalize well for few-shot learning. A suitable subspace fulfills three criteria across the given tasks: it (a) allows to reduce the training error by gradient flow, (b) leads to models that generalize well, and (c) can be identified by stochastic gradient descent. SubGD identifies these subspaces from an eigendecomposition of the auto-correlation matrix of update directions across different tasks. Demonstrably, we can identify low-dimensional suitable subspaces for few-shot learning of dynamical systems, which have varying properties described by one or few parameters of the analytical system description. Such systems are ubiquitous among real-world applications in science and engineering. We experimentally corroborate the advantages of SubGD on three distinct dynamical systems problem settings, significantly outperforming popular few-shot learning methods both in terms of sample efficiency and performance.
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v2 [stat.ML] UPDATED)
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.
    First is Better Than Last for Training Data Influence. (arXiv:2202.11844v2 [cs.LG] UPDATED)
    The ability to identify influential training examples enables us to debug training data and explain model behavior. Existing techniques to do so are based on the flow of training data influence through the model parameters. For large models in NLP applications, it is often computationally infeasible to study this flow through all model parameters, therefore techniques usually pick the last layer of weights. However, we observe that since the activation connected to the last layer of weights contains ``shared logic'', the data influenced calculated via the last layer weights prone to a ``cancellation effect'', where the data influence of different examples have large magnitude that contradicts each other. The cancellation effect lowers the discriminative power of the influence score, and deleting influential examples according to this measure often does not change the model's behavior by much. To mitigate this, we propose a technique called TracIn-WE that modifies a method called TracIn to operate on the word embedding layer instead of the last layer, where the cancellation effect is less severe. One potential concern is that influence based on the word embedding layer may not encode sufficient high level information. However, we find that gradients (unlike embeddings) do not suffer from this, possibly because they chain through higher layers. We show that TracIn-WE significantly outperforms other data influence methods applied on the last layer by 4-10 on the case deletion evaluation on three language classification tasks. In addition, TracIn-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input, a further aid in debugging.
    Towards a General Purpose CNN for Long Range Dependencies in $\mathrm{N}$D. (arXiv:2206.03398v1 [cs.LG])
    The use of Convolutional Neural Networks (CNNs) is widespread in Deep Learning due to a range of desirable model properties which result in an efficient and effective machine learning framework. However, performant CNN architectures must be tailored to specific tasks in order to incorporate considerations such as the input length, resolution, and dimentionality. In this work, we overcome the need for problem-specific CNN architectures with our Continuous Convolutional Neural Network (CCNN): a single CNN architecture equipped with continuous convolutional kernels that can be used for tasks on data of arbitrary resolution, dimensionality and length without structural changes. Continuous convolutional kernels model long range dependencies at every layer, and remove the need for downsampling layers and task-dependent depths needed in current CNN architectures. We show the generality of our approach by applying the same CCNN to a wide set of tasks on sequential (1$\mathrm{D}$) and visual data (2$\mathrm{D}$). Our CCNN performs competitively and often outperforms the current state-of-the-art across all tasks considered.
    A Robust Classification-autoencoder to Defend Outliers and Adversaries. (arXiv:2106.15927v2 [cs.LG] UPDATED)
    In this paper, a robust classification-autoencoder (CAE) is proposed, which has strong ability to recognize outliers and defend adversaries. The main idea is to change the autoencoder from an unsupervised learning model into a classifier, where the encoder is used to compress samples with different labels into disjoint compression spaces and the decoder is used to recover samples from their compression spaces. The encoder is used both as a compressed feature learner and as a classifier, and the decoder is used to decide whether the classification given by the encoder is correct by comparing the input sample with the output. Since adversary samples are seemingly inevitable for the current DNN framework, the list classifier to defend adversaries is introduced based on CAE, which outputs several labels and the corresponding samples recovered by the CAE. Extensive experimental results are used to show that the CAE achieves state of the art to recognize outliers by finding almost all outliers; the list classifier gives near lossless classification in the sense that the output list contains the correct label for almost all adversaries and the size of the output list is reasonably small.
    DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators. (arXiv:2201.11218v2 [cs.LG] UPDATED)
    Dataflow/mapping decides the compute and energy efficiency of DNN accelerators. Many mappers have been proposed to tackle the intra-layer map-space. However, mappers for inter-layer map-space (aka layer-fusion map-space), have been rarely discussed. In this work, we propose a mapper, DNNFuser, specifically focusing on this layer-fusion map-space. While existing SOTA DNN mapping explorations rely on search-based mappers, this is the first work, to the best of our knowledge, to propose a one-shot inference-based mapper. We leverage Transformer as our DNN architecture to learn layer-fusion optimization as a sequence modeling problem. Further, the trained DNNFuser can generalize its knowledge and infer new solutions for unseen conditions. Within one inference pass, DNNFuser can infer solutions with compatible performance to the ones found by a highly optimized search-based mapper while being 66x-127x faster.
    Generative modeling via tensor train sketching. (arXiv:2202.11788v2 [math.NA] UPDATED)
    In this paper we introduce a sketching algorithm for constructing a tensor train representation of a probability density from its samples. Our method deviates from the standard recursive SVD-based procedure for constructing a tensor train. Instead we formulate and solve a sequence of small linear systems for the individual tensor train cores. This approach can avoid the curse of dimensionality that threatens both the algorithmic and sample complexities of the recovery problem. Specifically, for Markov models, we prove that the tensor cores can be recovered with a sample complexity that is constant with respect to the dimension. Finally, we illustrate the performance of the method with several numerical experiments.
    Progressive Distillation for Fast Sampling of Diffusion Models. (arXiv:2202.00512v2 [cs.LG] UPDATED)
    Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.
    Computing Graph Descriptors on Edge Streams. (arXiv:2109.01494v4 [cs.LG] UPDATED)
    Feature extraction is an essential task in graph analytics. These feature vectors, called graph descriptors, are used in downstream vector-space-based graph analysis models. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy. However, known algorithms to compute meaningful descriptors do not scale to large graphs since: (1) they require storing the entire graph in memory, and (2) the end-user has no control over the algorithm's runtime. In this paper, we present streaming algorithms to approximately compute three different graph descriptors capturing the essential structure of graphs. Operating on edge streams allows us to avoid storing the entire graph in memory, and controlling the sample size enables us to keep the runtime of our algorithms within desired bounds. We demonstrate the efficacy of the proposed descriptors by analyzing the approximation error and classification accuracy. Our scalable algorithms compute descriptors of graphs with millions of edges within minutes. Moreover, these descriptors yield predictive accuracy comparable to the state-of-the-art methods but can be computed using only 25% as much memory.
    Unbiased estimators for random design regression. (arXiv:1907.03411v2 [stat.ML] UPDATED)
    In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points, which is a significant bottleneck in model averaging. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of $d$ points, drawn jointly according to a certain determinantal point process constructed from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum. We provide efficient algorithms for generating such unbiased estimators in a number of practical settings and support our claims experimentally.
    Analyzing the impact of feature selection on the accuracy of heart disease prediction. (arXiv:2206.03239v1 [cs.LG])
    Heart Disease has become one of the most serious diseases that has a significant impact on human life. It has emerged as one of the leading causes of mortality among the people across the globe during the last decade. In order to prevent patients from further damage, an accurate diagnosis of heart disease on time is an essential factor. Recently we have seen the usage of non-invasive medical procedures, such as artificial intelligence-based techniques in the field of medical. Specially machine learning employs several algorithms and techniques that are widely used and are highly useful in accurately diagnosing the heart disease with less amount of time. However, the prediction of heart disease is not an easy task. The increasing size of medical datasets has made it a complicated task for practitioners to understand the complex feature relations and make disease predictions. Accordingly, the aim of this research is to identify the most important risk-factors from a highly dimensional dataset which helps in the accurate classification of heart disease with less complications. For a broader analysis, we have used two heart disease datasets with various medical features. The classification results of the benchmarked models proved that there is a high impact of relevant features on the classification accuracy. Even with a reduced number of features, the performance of the classification models improved significantly with a reduced training time as compared with models trained on full feature set.
    Generating Long Videos of Dynamic Scenes. (arXiv:2206.03429v1 [cs.CV])
    We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.
    On Transportation of Mini-batches: A Hierarchical Approach. (arXiv:2102.05912v5 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with a very high number of supports. The m-OT solves several smaller optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, the m-OT does not consider the relationship between mini-batches which leads to undesirable estimation. Moreover, the m-OT does not approximate a proper metric between probability measures since the identity property is not satisfied. To address these problems, we propose a novel mini-batch scheme for optimal transport, named Batch of Mini-batches Optimal Transport (BoMb-OT), that finds the optimal coupling between mini-batches and it can be seen as an approximation to a well-defined distance on the space of probability measures. Furthermore, we show that the m-OT is a limit of the entropic regularized version of the BoMb-OT when the regularized parameter goes to infinity. Finally, we carry out experiments on various applications including deep generative models, deep domain adaptation, approximate Bayesian computation, color transfer, and gradient flow to show that the BoMb-OT can be widely applied and performs well in various applications.
    Towards Fairness-Aware Federated Learning. (arXiv:2111.01872v2 [cs.LG] UPDATED)
    Recent advances in Federated Learning (FL) have brought large-scale collaborative machine learning opportunities for massively distributed clients with performance and data privacy guarantees. However, most current works focus on the interest of the central controller in FL,and overlook the interests of the FL clients. This may result in unfair treatment of clients which discourages them from actively participating in the learning process and damages the sustainability of the FL ecosystem. Therefore, the topic of ensuring fairness in FL is attracting a great deal of research interest. In recent years, diverse Fairness-Aware FL (FAFL) approaches have been proposed in an effort to achieve fairness in FL from different perspectives. However, there is no comprehensive survey which helps readers gain insight into this interdisciplinary field. This paper aims to provide such a survey. By examining the fundamental and simplifying assumptions, as well as the notions of fairness adopted by existing literature in this field, we propose a taxonomy of FAFL approaches covering major steps in FL, including client selection, optimization, contribution evaluation and incentive distribution. In addition, we discuss the main metrics for experimentally evaluating the performance of FAFL approaches, and suggest promising future research directions towards fairness-aware federated learning.
    Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition. (arXiv:2206.03393v1 [cs.SD])
    Speaker recognition systems (SRSs) have recently been shown to be vulnerable to adversarial attacks, raising significant security concerns. In this work, we systematically investigate transformation and adversarial training based defenses for securing SRSs. According to the characteristic of SRSs, we present 22 diverse transformations and thoroughly evaluate them using 7 recent promising adversarial attacks (4 white-box and 3 black-box) on speaker recognition. With careful regard for best practices in defense evaluations, we analyze the strength of transformations to withstand adaptive attacks. We also evaluate and understand their effectiveness against adaptive attacks when combined with adversarial training. Our study provides lots of useful insights and findings, many of them are new or inconsistent with the conclusions in the image and speech recognition domains, e.g., variable and constant bit rate speech compressions have different performance, and some non-differentiable transformations remain effective against current promising evasion techniques which often work well in the image domain. We demonstrate that the proposed novel feature-level transformation combined with adversarial training is rather effective compared to the sole adversarial training in a complete white-box setting, e.g., increasing the accuracy by 13.62% and attack cost by two orders of magnitude, while other transformations do not necessarily improve the overall defense capability. This work sheds further light on the research directions in this field. We also release our evaluation platform SPEAKERGUARD to foster further research.
    Machine learning fairness notions: Bridging the gap with real-world applications. (arXiv:2006.16745v5 [cs.LG] UPDATED)
    Fairness emerged as an important requirement to guarantee that Machine Learning (ML) predictive systems do not discriminate against specific individuals or entire sub-populations, in particular, minorities. Given the inherent subjectivity of viewing the concept of fairness, several notions of fairness have been introduced in the literature. This paper is a survey that illustrates the subtleties between fairness notions through a large number of examples and scenarios. In addition, unlike other surveys in the literature, it addresses the question of: which notion of fairness is most suited to a given real-world scenario and why? Our attempt to answer this question consists in (1) identifying the set of fairness-related characteristics of the real-world scenario at hand, (2) analyzing the behavior of each fairness notion, and then (3) fitting these two elements to recommend the most suitable fairness notion in every specific setup. The results are summarized in a decision diagram that can be used by practitioners and policymakers to navigate the relatively large catalog of ML.
    Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization. (arXiv:2111.15645v4 [math.OC] UPDATED)
    For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging. Even when the objective is smooth at every iterate, the corresponding local models are unstable and the number of cutting planes invoked by traditional remedies is difficult to bound, leading to convergences guarantees that are sublinear relative to the cumulative number of gradient evaluations. We instead propose a multipoint generalization of the gradient descent iteration for local optimization. While designed with general objectives in mind, we are motivated by a ``max-of-smooth'' model that captures the subdifferential dimension at optimality. We prove linear convergence when the objective is itself max-of-smooth, and experiments suggest a more general phenomenon.
    On the Role of Discount Factor in Offline Reinforcement Learning. (arXiv:2206.03383v1 [cs.LG])
    Offline reinforcement learning (RL) enables effective learning from previously collected data without exploration, which shows great promise in real-world applications when exploration is expensive or even infeasible. The discount factor, $\gamma$, plays a vital role in improving online RL sample efficiency and estimation accuracy, but the role of the discount factor in offline RL is not well explored. This paper examines two distinct effects of $\gamma$ in offline RL with theoretical analysis, namely the regularization effect and the pessimism effect. On the one hand, $\gamma$ is a regulator to trade-off optimality with sample efficiency upon existing offline techniques. On the other hand, lower guidance $\gamma$ can also be seen as a way of pessimism where we optimize the policy's performance in the worst possible models. We empirically verify the above theoretical observation with tabular MDPs and standard D4RL tasks. The results show that the discount factor plays an essential role in the performance of offline RL algorithms, both under small data regimes upon existing offline methods and in large data regimes without other conservatisms.
    The Fragility of Optimized Bandit Algorithms. (arXiv:2109.13595v2 [cs.LG] UPDATED)
    Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also provide a sharp trade-off between the amount of UCB exploration and the tail exponent of the resulting regret distribution.
    Concentration bounds for SSP Q-learning for average cost MDPs. (arXiv:2206.03328v1 [cs.LG])
    We derive a concentration bound for a Q-learning algorithm for average cost Markov decision processes based on an equivalent shortest path problem, and compare it numerically with the alternative scheme based on relative value iteration.
    Recent Advances for Quantum Neural Networks in Generative Learning. (arXiv:2206.03066v1 [quant-ph])
    Quantum computers are next-generation devices that hold promise to perform calculations beyond the reach of classical computers. A leading method towards achieving this goal is through quantum machine learning, especially quantum generative learning. Due to the intrinsic probabilistic nature of quantum mechanics, it is reasonable to postulate that quantum generative learning models (QGLMs) may surpass their classical counterparts. As such, QGLMs are receiving growing attention from the quantum physics and computer science communities, where various QGLMs that can be efficiently implemented on near-term quantum machines with potential computational advantages are proposed. In this paper, we review the current progress of QGLMs from the perspective of machine learning. Particularly, we interpret these QGLMs, covering quantum circuit born machines, quantum generative adversarial networks, quantum Boltzmann machines, and quantum autoencoders, as the quantum extension of classical generative learning models. In this context, we explore their intrinsic relation and their fundamental differences. We further summarize the potential applications of QGLMs in both conventional machine learning tasks and quantum physics. Last, we discuss the challenges and further research directions for QGLMs.
    Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions. (arXiv:2206.03254v1 [cs.LG])
    This theoretical paper is devoted to developing a rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-parameterized Rectified Linear Unit (ReLU) nets for very high dimensional dataset under very mild assumptions. A major ingredient of our analysis is a fine-grained analysis of random activation matrices. The essential virtue of dissecting activation matrices is that it bridges the dynamics of optimization and angular distribution in high-dimensional data space. This angle-based detailed analysis leads to asymptotic characterizations of gradient norm and directional curvature of objective function at each gradient descent iteration, revealing that the empirical loss function enjoys nice geometrical properties in the overparameterized setting. Along the way, we significantly improve existing theoretical bounds on both over-parameterization condition and learning rate with very mild assumptions for learning very high dimensional data. Moreover, we uncover the role of the geometrical and spectral properties of the input data in determining desired over-parameterization size and global convergence rate. All these clues allow us to discover a novel geometric picture of nonconvex optimization in deep learning: angular distribution in high-dimensional data space $\mapsto$ spectrums of overparameterized activation matrices $\mapsto$ favorable geometrical properties of empirical loss landscape $\mapsto$ global convergence phenomenon. Furthremore, our theoretical results imply that gradient-based nonconvex optimization algorithms have much stronger statistical guarantees with much milder over-parameterization condition than exisiting theory states for learning very high dimensional data, which is rarely explored so far.
    Searching Similarity Measure for Binarized Neural Networks. (arXiv:2206.03325v1 [cs.LG])
    Being a promising model to be deployed in resource-limited devices, Binarized Neural Networks (BNNs) have drawn extensive attention from both academic and industry. However, comparing to the full-precision deep neural networks (DNNs), BNNs suffer from non-trivial accuracy degradation, limiting its applicability in various domains. This is partially because existing network components, such as the similarity measure, are specially designed for DNNs, and might be sub-optimal for BNNs. In this work, we focus on the key component of BNNs -- the similarity measure, which quantifies the distance between input feature maps and filters, and propose an automatic searching method, based on genetic algorithm, for BNN-tailored similarity measure. Evaluation results on Cifar10 and Cifar100 using ResNet, NIN and VGG show that most of the identified similarty measure can achieve considerable accuracy improvement (up to 3.39%) over the commonly-used cross-correlation approach.
    Rites de Passage: Elucidating Displacement to Emplacement of Refugees. (arXiv:2206.03248v1 [cs.CY])
    Social media deliberations allow to explore refugee-related is-sues. AI-based studies have investigated refugee issues mostly around a specific event and considered unimodal approaches. Contrarily, we have employed a multimodal architecture for probing the refugee journeys from their home to host nations. We draw insights from Arnold van Gennep's anthropological work 'Les Rites de Passage', which systematically analyzed an individual's transition from one group or society to another. Based on Gennep's separation-transition-incorporation framework, we have identified four phases of refugee journeys: Arrival of Refugees, Temporal stay at Asylums, Rehabilitation, and Integration of Refugees into the host nation. We collected 0.23 million multimodal tweets from April 2020 to March 2021 for testing this proposed frame-work. We find that a combination of transformer-based language models and state-of-the-art image recognition models, such as fusion of BERT+LSTM and InceptionV4, can out-perform unimodal models. Subsequently, to test the practical implication of our proposed model in real-time, we have considered 0.01 million multimodal tweets related to the 2022 Ukrainian refugee crisis. An F1-score of 71.88 % for this 2022 crisis confirms the generalizability of our proposed framework.
    FedRel: An Adaptive Federated Relevance Framework for Spatial Temporal Graph Learning. (arXiv:2206.03420v1 [cs.LG])
    Spatial-temporal data contains rich information and has been widely studied in recent years due to the rapid development of relevant applications in many fields. For instance, medical institutions often use electrodes attached to different parts of a patient to analyse the electorencephal data rich with spatial and temporal features for health assessment and disease diagnosis. Existing research has mainly used deep learning techniques such as convolutional neural network (CNN) or recurrent neural network (RNN) to extract hidden spatial-temporal features. Yet, it is challenging to incorporate both inter-dependencies spatial information and dynamic temporal changes simultaneously. In reality, for a model that leverages these spatial-temporal features to fulfil complex prediction tasks, it often requires a colossal amount of training data in order to obtain satisfactory model performance. Considering the above-mentioned challenges, we propose an adaptive federated relevance framework, namely FedRel, for spatial-temporal graph learning in this paper. After transforming the raw spatial-temporal data into high quality features, the core Dynamic Inter-Intra Graph (DIIG) module in the framework is able to use these features to generate the spatial-temporal graphs capable of capturing the hidden topological and long-term temporal correlation information in these graphs. To improve the model generalization ability and performance while preserving the local data privacy, we also design a relevance-driven federated learning module in our framework to leverage diverse data distributions from different participants with attentive aggregations of their models.
    Deep Neural Patchworks: Coping with Large Segmentation Tasks. (arXiv:2206.03210v1 [cs.CV])
    Convolutional neural networks are the way to solve arbitrary image segmentation tasks. However, when images are large, memory demands often exceed the available resources, in particular on a common GPU. Especially in biomedical imaging, where 3D images are common, the problems are apparent. A typical approach to solve this limitation is to break the task into smaller subtasks by dividing images into smaller image patches. Another approach, if applicable, is to look at the 2D image sections separately, and to solve the problem in 2D. Often, the loss of global context makes such approaches less effective; important global information might not be present in the current image patch, or the selected 2D image section. Here, we propose Deep Neural Patchworks (DNP), a segmentation framework that is based on hierarchical and nested stacking of patch-based networks that solves the dilemma between global context and memory limitations.
    FDGNN: Fully Dynamic Graph Neural Network. (arXiv:2206.03469v1 [cs.LG])
    Dynamic Graph Neural Networks recently became more and more important as graphs from many scientific fields, ranging from mathematics, biology, social sciences, and physics to computer science, are dynamic by nature. While temporal changes (dynamics) play an essential role in many real-world applications, most of the models in the literature on Graph Neural Networks (GNN) process static graphs. The few GNN models on dynamic graphs only consider exceptional cases of dynamics, e.g., node attribute-dynamic graphs or structure-dynamic graphs limited to additions or changes to the graph's edges, etc. Therefore, we present a novel Fully Dynamic Graph Neural Network (FDGNN) that can handle fully-dynamic graphs in continuous time. The proposed method provides a node and an edge embedding that includes their activity to address added and deleted nodes or edges, and possible attributes. Furthermore, the embeddings specify Temporal Point Processes for each event to encode the distributions of the structure- and attribute-related incoming graph events. In addition, our model can be updated efficiently by considering single events for local retraining.
    Quantum Neural Network Classifiers: A Tutorial. (arXiv:2206.02806v1 [quant-ph])
    Machine learning has achieved dramatic success over the past decade, with applications ranging from face recognition to natural language processing. Meanwhile, rapid progress has been made in the field of quantum computation including developing both powerful quantum algorithms and advanced quantum devices. The interplay between machine learning and quantum physics holds the intriguing potential for bringing practical applications to the modern society. Here, we focus on quantum neural networks in the form of parameterized quantum circuits. We will mainly discuss different structures and encoding strategies of quantum neural networks for supervised learning tasks, and benchmark their performance utilizing Yao.jl, a quantum simulation package written in Julia Language. The codes are efficient, aiming to provide convenience for beginners in scientific works such as developing powerful variational quantum learning models and assisting the corresponding experimental demonstrations.
    A Contribution-based Device Selection Scheme in Federated Learning. (arXiv:2203.05369v2 [cs.LG] UPDATED)
    In a Federated Learning (FL) setup, a number of devices contribute to the training of a common model. We present a method for selecting the devices that provide updates in order to achieve improved generalization, fast convergence, and better device-level performance. We formulate a min-max optimization problem and decompose it into a primal-dual setup, where the duality gap is used to quantify the device-level performance. Our strategy combines \emph{exploration} of data freshness through a random device selection with \emph{exploitation} through simplified estimates of device contributions. This improves the performance of the trained model both in terms of generalization and personalization. A modified Truncated Monte-Carlo (TMC) method is applied during the exploitation phase to estimate the device's contribution and lower the communication overhead. The experimental results show that the proposed approach has a competitive performance, with lower communication overhead and competitive personalization performance against the baseline schemes.
    Early Abnormal Detection of Sewage Pipe Network: Bagging of Various Abnormal Detection Algorithms. (arXiv:2206.03321v1 [cs.LG])
    Abnormalities of the sewage pipe network will affect the normal operation of the whole city. Therefore, it is important to detect the abnormalities early. This paper propose an early abnormal-detection method. The abnormalities are detected by using the conventional algorithms, such as isolation forest algorithm, two innovations are given: (1) The current and historical data measured by the sensors placed in the sewage pipe network (such as ultrasonic Doppler flowmeter) are taken as the overall dataset, and then the general dataset is detected by using the conventional anomaly detection method to diagnose the anomaly of the data. The anomaly refers to the sample different from the others samples in the whole dataset. Because the definition of anomaly is not through the algorithm, but the whole dataset, the construction of the whole dataset is the key to propose the early abnormal-detection algorithms. (2) A bagging strategy for a variety of conventional anomaly detection algorithms is proposed to achieve the early detection of anomalies with the high precision and recall. The results show that this method can achieve the early anomaly detection with the highest precision of 98.21%, the recall rate 63.58% and F1-score of 0.774.
    Short Blocklength Wiretap Channel Codes via Deep Learning: Design and Performance Evaluation. (arXiv:2206.03477v1 [cs.IT])
    We design short blocklength codes for the Gaussian wiretap channel under information-theoretic security guarantees. Our approach consists in decoupling the reliability and secrecy constraints in our code design. Specifically, we handle the reliability constraint via an autoencoder, and handle the secrecy constraint with hash functions. For blocklengths smaller than or equal to 16, we evaluate through simulations the probability of error at the legitimate receiver and the leakage at the eavesdropper for our code construction. This leakage is defined as the mutual information between the confidential message and the eavesdropper's channel observations, and is empirically measured via a neural network-based mutual information estimator. Our simulation results provide examples of codes with positive secrecy rates that outperform the best known achievable secrecy rates obtained non-constructively for the Gaussian wiretap channel. Additionally, we show that our code design is suitable for the compound and arbitrarily varying Gaussian wiretap channels, for which the channel statistics are not perfectly known but only known to belong to a pre-specified uncertainty set. These models not only capture uncertainty related to channel statistics estimation, but also scenarios where the eavesdropper jams the legitimate transmission or influences its own channel statistics by changing its location.
    An efficient semi-supervised quality control system trained using physics-based MRI-artefact generators and adversarial training. (arXiv:2206.03359v1 [eess.IV])
    Large medical imaging data sets are becoming increasingly available. A common challenge in these data sets is to ensure that each sample meets minimum quality requirements devoid of significant artefacts. Despite a wide range of existing automatic methods having been developed to identify imperfections and artefacts in medical imaging, they mostly rely on data-hungry methods. In particular, the lack of sufficient scans with artefacts available for training has created a barrier in designing and deploying machine learning in clinical research. To tackle this problem, we propose a novel framework having four main components: (1) a set of artefact generators inspired by magnetic resonance physics to corrupt brain MRI scans and augment a training dataset, (2) a set of abstract and engineered features to represent images compactly, (3) a feature selection process that depends on the class of artefact to improve classification performance, and (4) a set of Support Vector Machine (SVM) classifiers trained to identify artefacts. Our novel contributions are threefold: first, we use the novel physics-based artefact generators to generate synthetic brain MRI scans with controlled artefacts as a data augmentation technique. This will avoid the labour-intensive collection and labelling process of scans with rare artefacts. Second, we propose a large pool of abstract and engineered image features developed to identify 9 different artefacts for structural MRI. Finally, we use an artefact-based feature selection block that, for each class of artefacts, finds the set of features that provide the best classification performance. We performed validation experiments on a large data set of scans with artificially-generated artefacts, and in a multiple sclerosis clinical trial where real artefacts were identified by experts, showing that the proposed pipeline outperforms traditional methods.
    Unsupervised Domain Adaptation across FMCW Radar Configurations Using Margin Disparity Discrepancy. (arXiv:2203.04588v2 [eess.SP] UPDATED)
    Commercial radar sensing is gaining relevance and machine learning algorithms constitute one of the key components that are enabling the spread of this radio technology into areas like surveillance or healthcare. However, radar datasets are still scarce and generalization cannot be yet achieved for all radar systems, environment conditions or design parameters. A certain degree of fine tuning is, therefore, usually required to deploy machine-learning-enabled radar applications. In this work, we consider the problem of unsupervised domain adaptation across radar configurations in the context of deep-learning human activity classification using frequency-modulated continuous-wave. For that, we focus on the theory-inspired technique of Margin Disparity Discrepancy, which has already been proved successful in the area of computer vision. Our experiments extend this technique to radar data, achieving a comparable accuracy to fewshot supervised approaches for the same classification problem.
    Improving Mini-batch Optimal Transport via Partial Transportation. (arXiv:2108.09645v4 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications such as deep domain adaptation, partial domain adaptation, deep generative model, color transfer, and gradient flow to demonstrate the favorable performance of m-POT compared to current mini-batch methods.
    DeepOPF-AL: Augmented Learning for Solving AC-OPF Problems with Multiple Load-Solution Mappings. (arXiv:2206.03365v1 [cs.LG])
    The existence of multiple load-solution mappings of non-convex AC-OPF problems poses a fundamental challenge to deep neural network (DNN) schemes. As the training dataset may contain a mixture of data points corresponding to different load-solution mappings, the DNN can fail to learn a legitimate mapping and generate inferior solutions. We propose DeepOPF-AL as an augmented-learning approach to tackle this issue. The idea is to train a DNN to learn a unique mapping from an augmented input, i.e., (load, initial point), to the solution generated by an iterative OPF solver with the load and initial point as intake. We then apply the learned augmented mapping to solve AC-OPF problems much faster than conventional solvers. Simulation results over IEEE test cases show that DeepOPF-AL achieves noticeably better optimality and similar feasibility and speedup performance, as compared to a recent DNN scheme, with the same DNN size yet elevated training complexity.
    DETR++: Taming Your Multi-Scale Detection Transformer. (arXiv:2206.02977v1 [cs.CV])
    Convolutional Neural Networks (CNN) have dominated the field of detection ever since the success of AlexNet in ImageNet classification [12]. With the sweeping reform of Transformers [27] in natural language processing, Carion et al. [2] introduce the Transformer-based detection method, i.e., DETR. However, due to the quadratic complexity in the self-attention mechanism in the Transformer, DETR is never able to incorporate multi-scale features as performed in existing CNN-based detectors, leading to inferior results in small object detection. To mitigate this issue and further improve performance of DETR, in this work, we investigate different methods to incorporate multi-scale features and find that a Bi-directional Feature Pyramid (BiFPN) works best with DETR in further raising the detection precision. With this discovery, we propose DETR++, a new architecture that improves detection results by 1.9% AP on MS COCO 2017, 11.5% AP on RICO icon detection, and 9.1% AP on RICO layout extraction over existing baselines.
    Generalized Data Distribution Iteration. (arXiv:2206.03192v1 [cs.LG])
    To obtain higher sample efficiency and superior final performance simultaneously has been one of the major challenges for deep reinforcement learning (DRL). Previous work could handle one of these challenges but typically failed to address them concurrently. In this paper, we try to tackle these two challenges simultaneously. To achieve this, we firstly decouple these challenges into two classic RL problems: data richness and exploration-exploitation trade-off. Then, we cast these two problems into the training data distribution optimization problem, namely to obtain desired training data within limited interactions, and address them concurrently via i) explicit modeling and control of the capacity and diversity of behavior policy and ii) more fine-grained and adaptive control of selective/sampling distribution of the behavior policy using a monotonic data distribution optimization. Finally, we integrate this process into Generalized Policy Iteration (GPI) and obtain a more general framework called Generalized Data Distribution Iteration (GDI). We use the GDI framework to introduce operator-based versions of well-known RL methods from DQN to Agent57. Theoretical guarantee of the superiority of GDI compared with GPI is concluded. We also demonstrate our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.33% mean human normalized score (HNS), 1146.39% median HNS and surpassed 22 human world records using only 200M training frames. Our performance is comparable to Agent57's while we consume 500 times less data. We argue that there is still a long way to go before obtaining real superhuman agents in ALE.
    Machine Learning Sensors. (arXiv:2206.03266v1 [cs.LG])
    Machine learning sensors represent a paradigm shift for the future of embedded machine learning applications. Current instantiations of embedded machine learning (ML) suffer from complex integration, lack of modularity, and privacy and security concerns from data movement. This article proposes a more data-centric paradigm for embedding sensor intelligence on edge devices to combat these challenges. Our vision for "sensor 2.0" entails segregating sensor input data and ML processing from the wider system at the hardware level and providing a thin interface that mimics traditional sensors in functionality. This separation leads to a modular and easy-to-use ML sensor device. We discuss challenges presented by the standard approach of building ML processing into the software stack of the controlling microprocessor on an embedded system and how the modularity of ML sensors alleviates these problems. ML sensors increase privacy and accuracy while making it easier for system builders to integrate ML into their products as a simple component. We provide examples of prospective ML sensors and an illustrative datasheet as a demonstration and hope that this will build a dialogue to progress us towards sensor 2.0.
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v1 [stat.ML])
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.
    Boosting Search Engines with Interactive Agents. (arXiv:2109.00527v3 [cs.CL] UPDATED)
    This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that learns interactive search strategies from scratch. Our search agents obtain retrieval and answer quality performance comparable to recent neural methods, using only a traditional term-based BM25 ranking function and interpretable discrete reranking and filtering actions.
    On the Convergence of Optimizing Persistent-Homology-Based Losses. (arXiv:2206.02946v1 [cs.LG])
    Topological loss based on persistent homology has shown promise in various applications. A topological loss enforces the model to achieve certain desired topological property. Despite its empirical success, less is known about the optimization behavior of the loss. In fact, the topological loss involves combinatorial configurations that may oscillate during optimization. In this paper, we introduce a general purpose regularized topology-aware loss. We propose a novel regularization term and also modify existing topological loss. These contributions lead to a new loss function that not only enforces the model to have desired topological behavior, but also achieves satisfying convergence behavior. Our main theoretical result guarantees that the loss can be optimized efficiently, under mild assumptions.
    Deep Learning Models of the Discrete Component of the Galactic Interstellar Gamma-Ray Emission. (arXiv:2206.02819v1 [astro-ph.HE])
    A significant point-like component from the small scale (or discrete) structure in the H2 interstellar gas might be present in the Fermi-LAT data, but modeling this emission relies on observations of rare gas tracers only available in limited regions of the sky. Identifying this contribution is important to discriminate gamma-ray point sources from interstellar gas, and to better characterize extended gamma-ray sources. We design and train convolutional neural networks to predict this emission where observations of these rare tracers do not exist and discuss the impact of this component on the analysis of the Fermi-LAT data. In particular, we evaluate prospects to exploit this methodology in the characterization of the Fermi-LAT Galactic center excess through accurate modeling of point-like structures in the data to help distinguish between a point-like or smooth nature for the excess. We show that deep learning may be effectively employed to model the gamma-ray emission traced by these rare H2 proxies within statistical significance in data-rich regions, supporting prospects to employ these methods in yet unobserved regions.
    How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression. (arXiv:2206.03023v1 [cs.LG])
    Offline goal-conditioned reinforcement learning (GCRL) promises general-purpose skill learning in the form of reaching diverse goals from purely offline datasets. We propose $\textbf{Go}$al-conditioned $f$-$\textbf{A}$dvantage $\textbf{R}$egression (GoFAR), a novel regression-based offline GCRL algorithm derived from a state-occupancy matching perspective; the key intuition is that the goal-reaching task can be formulated as a state-occupancy matching problem between a dynamics-abiding imitator agent and an expert agent that directly teleports to the goal. In contrast to prior approaches, GoFAR does not require any hindsight relabeling and enjoys uninterleaved optimization for its value and policy networks. These distinct features confer GoFAR with much better offline performance and stability as well as statistical performance guarantee that is unattainable for prior methods. Furthermore, we demonstrate that GoFAR's training objectives can be re-purposed to learn an agent-independent goal-conditioned planner from purely offline source-domain data, which enables zero-shot transfer to new target domains. Through extensive experiments, we validate GoFAR's effectiveness in various problem settings and tasks, significantly outperforming prior state-of-art. Notably, on a real robotic dexterous manipulation task, while no other method makes meaningful progress, GoFAR acquires complex manipulation behavior that successfully accomplishes diverse goals.
    Simple Contrastive Graph Clustering. (arXiv:2205.07865v2 [cs.LG] UPDATED)
    Contrastive learning has recently attracted plenty of attention in deep graph clustering for its promising performance. However, complicated data augmentations and time-consuming graph convolutional operation undermine the efficiency of these methods. To solve this problem, we propose a Simple Contrastive Graph Clustering (SCGC) algorithm to improve the existing methods from the perspectives of network architecture, data augmentation, and objective function. As to the architecture, our network includes two main parts, i.e., pre-processing and network backbone. A simple low-pass denoising operation conducts neighbor information aggregation as an independent pre-processing, and only two multilayer perceptrons (MLPs) are included as the backbone. For data augmentation, instead of introducing complex operations over graphs, we construct two augmented views of the same vertex by designing parameter un-shared siamese encoders and corrupting the node embeddings directly. Finally, as to the objective function, to further improve the clustering performance, a novel cross-view structural consistency objective function is designed to enhance the discriminative capability of the learned network. Extensive experimental results on seven benchmark datasets validate our proposed algorithm's effectiveness and superiority. Significantly, our algorithm outperforms the recent contrastive deep clustering competitors with at least seven times speedup on average.
    A new Hyper-heuristic based on Adaptive Simulated Annealing and Reinforcement Learning for the Capacitated Electric Vehicle Routing Problem. (arXiv:2206.03185v1 [cs.AI])
    Electric vehicles (EVs) have been adopted in urban areas to reduce environmental pollution and global warming as a result of the increasing number of freight vehicles. However, there are still deficiencies in routing the trajectories of last-mile logistics that continue to impact social and economic sustainability. For that reason, in this paper, a hyper-heuristic (HH) approach called Hyper-heuristic Adaptive Simulated Annealing with Reinforcement Learning (HHASA$_{RL}$) is proposed. It is composed of a multi-armed bandit method and the self-adaptive Simulated Annealing (SA) metaheuristic algorithm for solving the problem called Capacitated Electric Vehicle Routing Problem (CEVRP). Due to the limited number of charging stations and the travel range of EVs, the EVs must require battery recharging moments in advance and reduce travel times and costs. The HH implemented improves multiple minimum best-known solutions and obtains the best mean values for some high-dimensional instances for the proposed benchmark for the IEEE WCCI2020 competition.
    SelfReformer: Self-Refined Network with Transformer for Salient Object Detection. (arXiv:2205.11283v2 [cs.CV] UPDATED)
    The global and local contexts significantly contribute to the integrity of predictions in Salient Object Detection (SOD). Unfortunately, existing methods still struggle to generate complete predictions with fine details. There are two major problems in conventional approaches: first, for global context, high-level CNN-based encoder features cannot effectively catch long-range dependencies, resulting in incomplete predictions. Second, downsampling the ground truth to fit the size of predictions will introduce inaccuracy as the ground truth details are lost during interpolation or pooling. Thus, in this work, we developed a Transformer-based network and framed a supervised task for a branch to learn the global context information explicitly. Besides, we adopt Pixel Shuffle from Super-Resolution (SR) to reshape the predictions back to the size of ground truth instead of the reverse. Thus details in the ground truth are untouched. In addition, we developed a two-stage Context Refinement Module (CRM) to fuse global context and automatically locate and refine the local details in the predictions. The proposed network can guide and correct itself based on the global and local context generated, thus is named, Self-Refined Transformer (SelfReformer). Extensive experiments and evaluation results on five benchmark datasets demonstrate the outstanding performance of the network, and we achieved the state-of-the-art.
    8-bit Numerical Formats for Deep Neural Networks. (arXiv:2206.02915v1 [cs.LG])
    Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the advantages of floating-point over fixed-point representation, and present an in-depth study on the use of 8-bit floating-point number formats for activations, weights, and gradients for both training and inference. We explore the effect of different bit-widths for exponents and significands and different exponent biases. The experimental results demonstrate that a suitable choice of these low-precision formats enables faster training and reduced power consumption without any degradation in accuracy for a range of deep learning models for image classification and language processing.
    Tight basis cycle representatives for persistent homology of large data sets. (arXiv:2206.02925v1 [cs.LG])
    Persistent homology (PH) is a popular tool for topological data analysis that has found applications across diverse areas of research. It provides a rigorous method to compute robust topological features in discrete experimental observations that often contain various sources of uncertainties. Although powerful in theory, PH suffers from high computation cost that precludes its application to large data sets. Additionally, most analyses using PH are limited to computing the existence of nontrivial features. Precise localization of these features is not generally attempted because, by definition, localized representations are not unique and because of even higher computation cost. For scientific applications, such a precise location is a sine qua non for determining functional significance. Here, we provide a strategy and algorithms to compute tight representative boundaries around nontrivial robust features in large data sets. To showcase the efficiency of our algorithms and the precision of computed boundaries, we analyze three data sets from different scientific fields. In the human genome, we found an unexpected effect on loops through chromosome 13 and the sex chromosomes, upon impairment of chromatin loop formation. In a distribution of galaxies in the universe, we found statistically significant voids. In protein homologs with significantly different topology, we found voids attributable to ligand-interaction, mutation, and differences between species.
    Risk Measures and Upper Probabilities: Coherence and Stratification. (arXiv:2206.03183v1 [cs.LG])
    Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of such alternatives, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we demonstrate a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.
    From "Where" to "What": Towards Human-Understandable Explanations through Concept Relevance Propagation. (arXiv:2206.03208v1 [cs.LG])
    The emerging field of eXplainable Artificial Intelligence (XAI) aims to bring transparency to today's powerful but opaque deep learning models. While local XAI methods explain individual predictions in form of attribution maps, thereby identifying where important features occur (but not providing information about what they represent), global explanation techniques visualize what concepts a model has generally learned to encode. Both types of methods thus only provide partial insights and leave the burden of interpreting the model's reasoning to the user. Only few contemporary techniques aim at combining the principles behind both local and global XAI for obtaining more informative explanations. Those methods, however, are often limited to specific model architectures or impose additional requirements on training regimes or data and label availability, which renders the post-hoc application to arbitrarily pre-trained models practically impossible. In this work we introduce the Concept Relevance Propagation (CRP) approach, which combines the local and global perspectives of XAI and thus allows answering both the "where" and "what" questions for individual predictions, without additional constraints imposed. We further introduce the principle of Relevance Maximization for finding representative examples of encoded concepts based on their usefulness to the model. We thereby lift the dependency on the common practice of Activation Maximization and its limitations. We demonstrate the capabilities of our methods in various settings, showcasing that Concept Relevance Propagation and Relevance Maximization lead to more human interpretable explanations and provide deep insights into the model's representations and reasoning through concept atlases, concept composition analyses, and quantitative investigations of concept subspaces and their role in fine-grained decision making.
    Distributionally Invariant Learning: Rationalization and Practical Algorithms. (arXiv:2206.02990v1 [cs.LG])
    The invariance property across environments is at the heart of invariant learning methods for the Out-of-Distribution (OOD) Generalization problem. Although intuitively reasonable, strong assumptions on the availability and quality of environments have to be made for the learnability of the strict invariance property. Recently, to relax the requirements for environments empirically, some works propose to learn pseudo-environments for invariant learning. However, it could be misleading when pursuing strict invariance under latent heterogeneity, since the underlying invariance could have been violated during the pseudo-environment learning procedure. To this end, we come up with the distributional invariance property as a relaxed alternative to the strict invariance, which considers the invariance only among sub-populations down to a prescribed scale and allows a certain degree of variation. We reformulate the invariant learning problem under latent heterogeneity into a relaxed form that pursues the distributional invariance, based on which we propose our novel Distributionally Invariant Learning (DIL) framework as well as two implementations named DIL-MMD and DIL-KL. Theoretically, we provide the guarantees for the distributional invariance as well as bounds of the generalization error gap. Extensive experimental results validate the effectiveness of our proposed algorithms.
    Survey on Causal-based Machine Learning Fairness Notions. (arXiv:2010.09553v7 [cs.LG] UPDATED)
    Addressing the problem of fairness is crucial to safely use machine learning algorithms to support decisions with a critical impact on people's lives such as job hiring, child maltreatment, disease diagnosis, loan granting, etc. Several notions of fairness have been defined and examined in the past decade, such as statistical parity and equalized odds. The most recent fairness notions, however, are causal-based and reflect the now widely accepted idea that using causality is necessary to appropriately address the problem of fairness. This paper examines an exhaustive list of causal-based fairness notions and study their applicability in real-world scenarios. As the majority of causal-based fairness notions are defined in terms of non-observable quantities (e.g., interventions and counterfactuals), their deployment in practice requires to compute or estimate those quantities using observational data. This paper offers a comprehensive report of the different approaches to infer causal quantities from observational data including identifiability (Pearl's SCM framework) and estimation (potential outcome framework). The main contributions of this survey paper are (1) a guideline to help selecting a suitable fairness notion given a specific real-world scenario, and (2) a ranking of the fairness notions according to Pearl's causation ladder indicating how difficult it is to deploy each notion in practice.
    Confounder Analysis in Measuring Representation in Product Funnels. (arXiv:2206.02962v1 [stat.ML])
    This paper discusses an application of Shapley values in the causal inference field, specifically on how to select the top confounder variables for coarsened exact matching method in a scalable way. We use a dataset from an observational experiment involving LinkedIn members as a use case to test its applicability, and show that Shapley values are highly informational and can be leveraged for its robust importance-ranking capability.
    Driving in Real Life with Inverse Reinforcement Learning. (arXiv:2206.03004v1 [cs.RO])
    In this paper, we introduce the first learning-based planner to drive a car in dense, urban traffic using Inverse Reinforcement Learning (IRL). Our planner, DriveIRL, generates a diverse set of trajectory proposals, filters these trajectories with a lightweight and interpretable safety filter, and then uses a learned model to score each remaining trajectory. The best trajectory is then tracked by the low-level controller of our self-driving vehicle. We train our trajectory scoring model on a 500+ hour real-world dataset of expert driving demonstrations in Las Vegas within the maximum entropy IRL framework. DriveIRL's benefits include: a simple design due to only learning the trajectory scoring function, relatively interpretable features, and strong real-world performance. We validated DriveIRL on the Las Vegas Strip and demonstrated fully autonomous driving in heavy traffic, including scenarios involving cut-ins, abrupt braking by the lead vehicle, and hotel pickup/dropoff zones. Our dataset will be made public to help further research in this area.
    Does Crypto Kill? Relationship between Electricity Consumption Carbon Footprints and Bitcoin Transactions. (arXiv:2206.03227v1 [cs.CY])
    Cryptocurrencies are gaining more popularity due to their security, making counterfeits impossible. However, these digital currencies have been criticized for creating a large carbon footprint due to their algorithmic complexity and decentralized system design for proof of work and mining. We hypothesize that the carbon footprint of cryptocurrency transactions has a higher dependency on carbon-rich fuel sources than green or renewable fuel sources. We provide a machine learning framework to model such transactions and correlate them with the electricity generation patterns to estimate and analyze their carbon cost.
    Machine learning models for determination of weldbead shape parameters for gas metal arc welded T-joints -- A comparative study. (arXiv:2206.02794v1 [cs.LG])
    The shape of a weld bead is critical in assessing the quality of the welded joint. In particular, this has a major impact in the accuracy of the results obtained from a numerical analysis. This study focuses on the statistical design techniques and the artificial neural networks, to predict the weld bead shape parameters of shielded Gas Metal Arc Welded (GMAW) fillet joints. Extensive testing was carried out on low carbon mild steel plates of thicknesses ranging from 3mm to 10mm. Welding voltage, welding current, and moving heat source speed were considered as the welding parameters. Three types of multiple linear regression models (MLR) were created to establish an empirical equation for defining GMAW bead shape parameters considering interactive and higher order terms. Additionally, artificial neural network (ANN) models were created based on similar scheme, and the relevance of specific features was investigated using SHapley Additive exPlanations (SHAP). The results reveal that MLR-based approach performs better than the ANN based models in terms of predictability and error assessment. This study shows the usefulness of the predictive tools to aid numerical analysis of welding.
    Intelligent Circuit Design and Implementation with Machine Learning. (arXiv:2206.03032v1 [cs.LG])
    The stagnation of EDA technologies roots from insufficient knowledge reuse. In practice, very similar simulation or optimization results may need to be repeatedly constructed from scratch. This motivates my research on introducing more 'intelligence' to EDA with machine learning (ML), which explores complex correlations in design flows based on prior data. Besides design time, I also propose ML solutions to boost IC performance by assisting the circuit management at runtime. In this dissertation, I present multiple fast yet accurate ML models covering a wide range of chip design stages from the register-transfer level (RTL) to sign-off, solving primary chip-design problems about power, timing, interconnect, IR drop, routability, and design flow tuning. Targeting the RTL stage, I present APOLLO, a fully automated power modeling framework. It constructs an accurate per-cycle power model by extracting the most power-correlated signals. The model can be further implemented on chip for runtime power management with unprecedented low hardware costs. Targeting gate-level netlist, I present Net2 for early estimations on post-placement wirelength. It further enables more accurate timing analysis without actual physical design information. Targeting circuit layout, I present RouteNet for early routability prediction. As the first deep learning-based routability estimator, some feature-extraction and model-design principles proposed in it are widely adopted by later works. I also present PowerNet for fast IR drop estimation. It captures spatial and temporal information about power distribution with a customized CNN architecture. Last, besides targeting a single design step, I present FIST to efficiently tune design flow parameters during both logic synthesis and physical design.
    Beyond spectral gap: The role of the topology in decentralized learning. (arXiv:2206.03093v1 [cs.LG])
    In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which all workers sample from the same dataset, and communicate over a sparse graph (decentralized). In this setting, current theory fails to capture important aspects of real-world behavior. First, the 'spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.
    Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances. (arXiv:2206.03230v1 [stat.ML])
    The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.
    Federated Hetero-Task Learning. (arXiv:2206.03436v1 [cs.LG])
    To investigate the heterogeneity of federated learning in real-world scenarios, we generalize the classical federated learning to federated hetero-task learning, which emphasizes the inconsistency across the participants in federated learning in terms of both data distribution and learning tasks. We also present B-FHTL, a federated hetero-task learning benchmark consisted of simulation dataset, FL protocols and a unified evaluation mechanism. B-FHTL dataset contains three well-designed federated learning tasks with increasing heterogeneity. Each task simulates the clients with different data distributions and learning tasks. To ensure fair comparison among different FL algorithms, B-FHTL builds in a full suite of FL protocols by providing high-level APIs to avoid privacy leakage, and presets most common evaluation metrics spanning across different learning tasks, such as regression, classification, text generation and etc. Furthermore, we compare the FL algorithms in fields of federated multi-task learning, federated personalization and federated meta learning within B-FHTL, and highlight the influence of heterogeneity and difficulties of federated hetero-task learning. Our benchmark, including the federated dataset, protocols, the evaluation mechanism and the preliminary experiment, is open-sourced at https://github.com/alibaba/FederatedScope/tree/contest/v1.0.
    Per-Instance Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v2 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose an efficient algorithm to compute per-instance privacy guarantees for individual examples when running DP-SGD. We use our algorithm to investigate per-instance privacy losses across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bounds. We further discover that the loss and the privacy loss on an example are well-correlated. This implies groups that are underserved in terms of model utility are simultaneously underserved in terms of privacy loss. For example, on CIFAR-10, the average $\epsilon$ of the class with the highest loss (Cat) is 32% higher than that of the class with the lowest loss (Ship). We also run membership inference attacks to show this reflects disparate empirical privacy risks.
    PyTSK: A Python Toolbox for TSK Fuzzy Systems. (arXiv:2206.03310v1 [cs.LG])
    This paper presents PyTSK, a Python toolbox for developing Takagi-Sugeno-Kang (TSK) fuzzy systems. Based on scikit-learn and PyTorch, PyTSK allows users to optimize TSK fuzzy systems using fuzzy clustering or mini-batch gradient descent (MBGD) based algorithms. Several state-of-the-art MBGD-based optimization algorithms are implemented in the toolbox, which can improve the generalization performance of TSK fuzzy systems, especially for big data applications. PyTSK can also be easily extended and customized for more complicated algorithms, such as modifying the structure of TSK fuzzy systems, developing more sophisticated training algorithms, and combining TSK fuzzy systems with neural networks. The code of PyTSK can be found at https://github.com/YuqiCui/pytsk.
    Label-Free Explainability for Unsupervised Models. (arXiv:2203.01928v2 [cs.LG] UPDATED)
    Unsupervised black-box models are challenging to interpret. Indeed, most existing explainability methods require labels to select which component(s) of the black-box's output to interpret. In the absence of labels, black-box outputs often are representation vectors whose components do not correspond to any meaningful quantity. Hence, choosing which component(s) to interpret in a label-free unsupervised/self-supervised setting is an important, yet unsolved problem. To bridge this gap in the literature, we introduce two crucial extensions of post-hoc explanation techniques: (1) label-free feature importance and (2) label-free example importance that respectively highlight influential features and training examples for a black-box to construct representations at inference time. We demonstrate that our extensions can be successfully implemented as simple wrappers around many existing feature and example importance methods. We illustrate the utility of our label-free explainability paradigm through a qualitative and quantitative comparison of representation spaces learned by various autoencoders trained on distinct unsupervised tasks.
    Spatial-Temporal Adaptive Graph Convolution with Attention Network for Traffic Forecasting. (arXiv:2206.03128v1 [cs.LG])
    Traffic forecasting is one canonical example of spatial-temporal learning task in Intelligent Traffic System. Existing approaches capture spatial dependency with a pre-determined matrix in graph convolution neural operators. However, the explicit graph structure losses some hidden representations of relationships among nodes. Furthermore, traditional graph convolution neural operators cannot aggregate long-range nodes on the graph. To overcome these limits, we propose a novel network, Spatial-Temporal Adaptive graph convolution with Attention Network (STAAN) for traffic forecasting. Firstly, we adopt an adaptive dependency matrix instead of using a pre-defined matrix during GCN processing to infer the inter-dependencies among nodes. Secondly, we integrate PW-attention based on graph attention network which is designed for global dependency, and GCN as spatial block. What's more, a stacked dilated 1D convolution, with efficiency in long-term prediction, is adopted in our temporal block for capturing the different time series. We evaluate our STAAN on two real-world datasets, and experiments validate that our model outperforms state-of-the-art baselines.
    Stratified Rule-Aware Network for Abstract Visual Reasoning. (arXiv:2002.06838v3 [cs.CV] UPDATED)
    Abstract reasoning refers to the ability to analyze information, discover rules at an intangible level, and solve problems in innovative ways. Raven's Progressive Matrices (RPM) test is typically used to examine the capability of abstract reasoning. The subject is asked to identify the correct choice from the answer set to fill the missing panel at the bottom right of RPM (e.g., a 3$\times$3 matrix), following the underlying rules inside the matrix. Recent studies, taking advantage of Convolutional Neural Networks (CNNs), have achieved encouraging progress to accomplish the RPM test. However, they partly ignore necessary inductive biases of RPM solver, such as order sensitivity within each row/column and incremental rule induction. To address this problem, in this paper we propose a Stratified Rule-Aware Network (SRAN) to generate the rule embeddings for two input sequences. Our SRAN learns multiple granularity rule embeddings at different levels, and incrementally integrates the stratified embedding flows through a gated fusion module. With the help of embeddings, a rule similarity metric is applied to guarantee that SRAN can not only be trained using a tuplet loss but also infer the best answer efficiently. We further point out the severe defects existing in the popular RAVEN dataset for RPM test, which prevent from the fair evaluation of the abstract reasoning ability. To fix the defects, we propose an answer set generation algorithm called Attribute Bisection Tree (ABT), forming an improved dataset named Impartial-RAVEN (I-RAVEN for short). Extensive experiments are conducted on both PGM and I-RAVEN datasets, showing that our SRAN outperforms the state-of-the-art models by a considerable margin.
    Improving Model Understanding and Trust with Counterfactual Explanations of Model Confidence. (arXiv:2206.02790v1 [cs.LG])
    In this paper, we show that counterfactual explanations of confidence scores help users better understand and better trust an AI model's prediction in human-subject studies. Showing confidence scores in human-agent interaction systems can help build trust between humans and AI systems. However, most existing research only used the confidence score as a form of communication, and we still lack ways to explain why the algorithm is confident. This paper also presents two methods for understanding model confidence using counterfactual explanation: (1) based on counterfactual examples; and (2) based on visualisation of the counterfactual space.
    A Bird's-Eye Tutorial of Graph Attention Architectures. (arXiv:2206.02849v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown tremendous strides in performance for graph-structured problems especially in the domains of natural language processing, computer vision and recommender systems. Inspired by the success of the transformer architecture, there has been an ever-growing body of work on attention variants of GNNs attempting to advance the state of the art in many of these problems. Incorporating "attention" into graph mining has been viewed as a way to overcome the noisiness, heterogenity and complexity associated with graph-structured data as well as to encode soft-inductive bias. It is hence crucial and advantageous to study these variants from a bird's-eye view to assess their strengths and weaknesses. We provide a systematic and focused tutorial centered around attention based GNNs in a hope to benefit researchers dealing with graph-structured problems. Our tutorial looks at GNN variants from the point of view of the attention function and iteratively builds the reader's understanding of different graph attention variants.
    On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning. (arXiv:2206.03271v1 [cs.LG])
    Intelligent agents should have the ability to leverage knowledge from previously learned tasks in order to learn new ones quickly and efficiently. Meta-learning approaches have emerged as a popular solution to achieve this. However, meta-reinforcement learning (meta-RL) algorithms have thus far been restricted to simple environments with narrow task distributions. Moreover, the paradigm of pretraining followed by fine-tuning to adapt to new tasks has emerged as a simple yet effective solution in supervised and self-supervised learning. This calls into question the benefits of meta-learning approaches also in reinforcement learning, which typically come at the cost of high complexity. We hence investigate meta-RL approaches in a variety of vision-based benchmarks, including Procgen, RLBench, and Atari, where evaluations are made on completely novel tasks. Our findings show that when meta-learning approaches are evaluated on different tasks (rather than different variations of the same task), multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-pretraining with meta test-time adaptation. This is encouraging for future research, as multi-task pretraining tends to be simpler and computationally cheaper than meta-RL. From these findings, we advocate for evaluating future meta-RL methods on more challenging tasks and including multi-task pretraining with fine-tuning as a simple, yet strong baseline.
    SHRED: 3D Shape Region Decomposition with Learned Local Operations. (arXiv:2206.03480v1 [cs.CV])
    We present SHRED, a method for 3D SHape REgion Decomposition. SHRED takes a 3D point cloud as input and uses learned local operations to produce a segmentation that approximates fine-grained part instances. We endow SHRED with three decomposition operations: splitting regions, fixing the boundaries between regions, and merging regions together. Modules are trained independently and locally, allowing SHRED to generate high-quality segmentations for categories not seen during training. We train and evaluate SHRED with fine-grained segmentations from PartNet; using its merge-threshold hyperparameter, we show that SHRED produces segmentations that better respect ground-truth annotations compared with baseline methods, at any desired decomposition granularity. Finally, we demonstrate that SHRED is useful for downstream applications, out-performing all baselines on zero-shot fine-grained part instance segmentation and few-shot fine-grained semantic segmentation when combined with methods that learn to label shape regions.
    Intra-agent speech permits zero-shot task acquisition. (arXiv:2206.03139v1 [cs.LG])
    Human language learners are exposed to a trickle of informative, context-sensitive language, but a flood of raw sensory data. Through both social language use and internal processes of rehearsal and practice, language learners are able to build high-level, semantic representations that explain their perceptions. Here, we take inspiration from such processes of "inner speech" in humans (Vygotsky, 1934) to better understand the role of intra-agent speech in embodied behavior. First, we formally pose intra-agent speech as a semi-supervised problem and develop two algorithms that enable visually grounded captioning with little labeled language data. We then experimentally compute scaling curves over different amounts of labeled data and compare the data efficiency against a supervised learning baseline. Finally, we incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world, and show that with as few as 150 additional image captions, intra-agent speech endows the agent with the ability to manipulate and answer questions about a new object without any related task-directed experience (zero-shot). Taken together, our experiments suggest that modelling intra-agent speech is effective in enabling embodied agents to learn new tasks efficiently and without direct interaction experience.
    Robust Sparse Mean Estimation via Sum of Squares. (arXiv:2206.03441v1 [cs.DS])
    We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with "certifiably bounded" $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.
    Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints. (arXiv:2202.01661v2 [cs.CY] UPDATED)
    In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.
    Distributive Justice as the Foundational Premise of Fair ML: Unification, Extension, and Interpretation of Group Fairness Metrics. (arXiv:2206.02897v1 [cs.CY])
    Group fairness metrics are an established way of assessing the fairness of prediction-based decision-making systems. However, these metrics are still insufficiently linked to philosophical theories, and their moral meaning is often unclear. We propose a general framework for analyzing the fairness of decision systems based on theories of distributive justice, encompassing different established ``patterns of justice'' that correspond to different normative positions. We show that the most popular group fairness metrics can be interpreted as special cases of our approach. Thus, we provide a unifying and interpretative framework for group fairness metrics that reveals the normative choices associated with each of them and that allows understanding their moral substance. At the same time, we provide an extension of the space of possible fairness metrics beyond the ones currently discussed in the fair ML literature. Our framework also allows overcoming several limitations of group fairness metrics that have been criticized in the literature, most notably (1) that they are parity-based, i.e., that they demand some form of equality between groups, which may sometimes be harmful to marginalized groups, (2) that they only compare decisions across groups, but not the resulting consequences for these groups, and (3) that the full breadth of the distributive justice literature is not sufficiently represented.
    Invertible Sharpening Network for MRI Reconstruction Enhancement. (arXiv:2206.02838v1 [eess.IV])
    High-quality MRI reconstruction plays a critical role in clinical applications. Deep learning-based methods have achieved promising results on MRI reconstruction. However, most state-of-the-art methods were designed to optimize the evaluation metrics commonly used for natural images, such as PSNR and SSIM, whereas the visual quality is not primarily pursued. Compared to the fully-sampled images, the reconstructed images are often blurry, where high-frequency features might not be sharp enough for confident clinical diagnosis. To this end, we propose an invertible sharpening network (InvSharpNet) to improve the visual quality of MRI reconstructions. During training, unlike the traditional methods that learn to map the input data to the ground truth, InvSharpNet adapts a backward training strategy that learns a blurring transform from the ground truth (fully-sampled image) to the input data (blurry reconstruction). During inference, the learned blurring transform can be inverted to a sharpening transform leveraging the network's invertibility. The experiments on various MRI datasets demonstrate that InvSharpNet can improve reconstruction sharpness with few artifacts. The results were also evaluated by radiologists, indicating better visual quality and diagnostic confidence of our proposed method.
    Interpolation-based Correlation Reduction Network for Semi-Supervised Graph Learning. (arXiv:2206.02796v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved promising performance in semi-supervised node classification in recent years. However, the problem of insufficient supervision, together with representation collapse, largely limits the performance of the GNNs in this field. To alleviate the collapse of node representations in semi-supervised scenario, we propose a novel graph contrastive learning method, termed Interpolation-based Correlation Reduction Network (ICRN). In our method, we improve the discriminative capability of the latent feature by enlarging the margin of decision boundaries and improving the cross-view consistency of the latent representation. Specifically, we first adopt an interpolation-based strategy to conduct data augmentation in the latent space and then force the prediction model to change linearly between samples. Second, we enable the learned network to tell apart samples across two interpolation-perturbed views through forcing the correlation matrix across views to approximate an identity matrix. By combining the two settings, we extract rich supervision information from both the abundant unlabeled nodes and the rare yet valuable labeled nodes for discriminative representation learning. Extensive experimental results on six datasets demonstrate the effectiveness and the generality of ICRN compared to the existing state-of-the-art methods.
    FedNST: Federated Noisy Student Training for Automatic Speech Recognition. (arXiv:2206.02797v1 [eess.AS])
    Federated Learning (FL) enables training state-of-the-art Automatic Speech Recognition (ASR) models on user devices (clients) in distributed systems, hence preventing transmission of raw user data to a central server. A key challenge facing practical adoption of FL for ASR is obtaining ground-truth labels on the clients. Existing approaches rely on clients to manually transcribe their speech, which is impractical for obtaining large training corpora. A promising alternative is using semi-/self-supervised learning approaches to leverage unlabelled user data. To this end, we propose a new Federated ASR method called FedNST for noisy student training of distributed ASR models with private unlabelled user data. We explore various facets of FedNST , such as training models with different proportions of unlabelled and labelled data, and evaluate the proposed approach on 1173 simulated clients. Evaluating FedNST on LibriSpeech, where 960 hours of speech data is split equally into server (labelled) and client (unlabelled) data, showed a 22.5% relative word error rate reduction (WERR) over a supervised baseline trained only on server data.
    Parametric Chordal Sparsity for SDP-based Neural Network Verification. (arXiv:2206.03482v1 [cs.LG])
    Many future technologies rely on neural networks, but verifying the correctness of their behavior remains a major challenge. It is known that neural networks can be fragile in the presence of even small input perturbations, yielding unpredictable outputs. The verification of neural networks is therefore vital to their adoption, and a number of approaches have been proposed in recent years. In this paper we focus on semidefinite programming (SDP) based techniques for neural network verification, which are particularly attractive because they can encode expressive behaviors while ensuring a polynomial time decision. Our starting point is the DeepSDP framework proposed by Fazlyab et al, which uses quadratic constraints to abstract the verification problem into a large-scale SDP. When the size of the neural network grows, however, solving this SDP quickly becomes intractable. Our key observation is that by leveraging chordal sparsity and specific parametrizations of DeepSDP, we can decompose the primary computational bottleneck of DeepSDP -- a large linear matrix inequality (LMI) -- into an equivalent collection of smaller LMIs. Our parametrization admits a tunable parameter, allowing us to trade-off efficiency and accuracy in the verification procedure. We call our formulation Chordal-DeepSDP, and provide experimental evaluation to show that it can: (1) effectively increase accuracy with the tunable parameter and (2) outperform DeepSDP on deeper networks.
    A Justice-Based Framework for the Analysis of Algorithmic Fairness-Utility Trade-Offs. (arXiv:2206.02891v1 [cs.CY])
    In prediction-based decision-making systems, different perspectives can be at odds: The short-term business goals of the decision makers are often in conflict with the decision subjects' wish to be treated fairly. Balancing these two perspectives is a question of values. We provide a framework to make these value-laden choices clearly visible. For this, we assume that we are given a trained model and want to find decision rules that balance the perspective of the decision maker and of the decision subjects. We provide an approach to formalize both perspectives, i.e., to assess the utility of the decision maker and the fairness towards the decision subjects. In both cases, the idea is to elicit values from decision makers and decision subjects that are then turned into something measurable. For the fairness evaluation, we build on the literature on welfare-based fairness and ask what a fair distribution of utility (or welfare) looks like. In this step, we build on well-known theories of distributive justice. This allows us to derive a fairness score that we then compare to the decision maker's utility for many different decision rules. This way, we provide an approach for balancing the utility of the decision maker and the fairness towards the decision subjects for a prediction-based decision-making system.
    Towards Job-Transition-Tag Graph for a Better Job Title Representation Learning. (arXiv:2206.02782v1 [cs.LG])
    Works on learning job title representation are mainly based on \textit{Job-Transition Graph}, built from the working history of talents. However, since these records are usually messy, this graph is very sparse, which affects the quality of the learned representation and hinders further analysis. To address this specific issue, we propose to enrich the graph with additional nodes that improve the quality of job title representation. Specifically, we construct \textit{Job-Transition-Tag Graph}, a heterogeneous graph containing two types of nodes, i.e., job titles and tags (i.e., words related to job responsibilities or functionalities). Along this line, we reformulate job title representation learning as the task of learning node embedding on the \textit{Job-Transition-Tag Graph}. Experiments on two datasets show the interest of our approach.
    FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data. (arXiv:2206.02792v1 [cs.LG])
    Algorithmic fairness plays an important role in machine learning and imposing fairness constraints during learning is a common approach. However, many datasets are imbalanced in certain label classes (e.g. "healthy") and sensitive subgroups (e.g. "older patients"). Empirically, this imbalance leads to a lack of generalizability not only of classification, but also of fairness properties, especially in over-parameterized models. For example, fairness-aware training may ensure equalized odds (EO) on the training data, but EO is far from being satisfied on new users. In this paper, we propose a theoretically-principled, yet Flexible approach that is Imbalance-Fairness-Aware (FIFA). Specifically, FIFA encourages both classification and fairness generalization and can be flexibly combined with many existing fair learning methods with logits-based losses. While our main focus is on EO, FIFA can be directly applied to achieve equalized opportunity (EqOpt); and under certain conditions, it can also be applied to other fairness notions. We demonstrate the power of FIFA by combining it with a popular fair classification algorithm, and the resulting algorithm achieves significantly better fairness generalization on several real-world datasets.
    Robust Time Series Dissimilarity Measure for Outlier Detection and Periodicity Detection. (arXiv:2206.02956v1 [cs.LG])
    Dynamic time warping (DTW) is an effective dissimilarity measure in many time series applications. Despite its popularity, it is prone to noises and outliers, which leads to singularity problem and bias in the measurement. The time complexity of DTW is quadratic to the length of time series, making it inapplicable in real-time applications. In this paper, we propose a novel time series dissimilarity measure named RobustDTW to reduce the effects of noises and outliers. Specifically, the RobustDTW estimates the trend and optimizes the time warp in an alternating manner by utilizing our designed temporal graph trend filtering. To improve efficiency, we propose a multi-level framework that estimates the trend and the warp function at a lower resolution, and then repeatedly refines them at a higher resolution. Based on the proposed RobustDTW, we further extend it to periodicity detection and outlier time series detection. Experiments on real-world datasets demonstrate the superior performance of RobustDTW compared to DTW variants in both outlier time series detection and periodicity detection.
    DynaMaR: Dynamic Prompt with Mask Token Representation. (arXiv:2206.02982v1 [cs.CL])
    Recent research has shown that large language models pretrained using unsupervised approaches can achieve significant performance improvement on many downstream tasks. Typically when adapting these language models to downstream tasks, like a classification or regression task, we employ a fine-tuning paradigm in which the sentence representation from the language model is input to a task-specific head; the model is then fine-tuned end-to-end. However, with the emergence of models like GPT-3, prompt-based fine-tuning has been proven to be a successful approach for few-shot tasks. Inspired by this work, we study discrete prompt technologies in practice. There are two issues that arise with the standard prompt approach. First, it can overfit on the prompt template. Second, it requires manual effort to formulate the downstream task as a language model problem. In this paper, we propose an improvement to prompt-based fine-tuning that addresses these two issues. We refer to our approach as DynaMaR -- Dynamic Prompt with Mask Token Representation. Results show that DynaMaR can achieve an average improvement of 10% in few-shot settings and improvement of 3.7% in data-rich settings over the standard fine-tuning approach on four e-commerce applications.
    Collaborative Intelligence Orchestration: Inconsistency-Based Fusion of Semi-Supervised Learning and Active Learning. (arXiv:2206.03288v1 [cs.LG])
    While annotating decent amounts of data to satisfy sophisticated learning models can be cost-prohibitive for many real-world applications. Active learning (AL) and semi-supervised learning (SSL) are two effective, but often isolated, means to alleviate the data-hungry problem. Some recent studies explored the potential of combining AL and SSL to better probe the unlabeled data. However, almost all these contemporary SSL-AL works use a simple combination strategy, ignoring SSL and AL's inherent relation. Further, other methods suffer from high computational costs when dealing with large-scale, high-dimensional datasets. Motivated by the industry practice of labeling data, we propose an innovative Inconsistency-based virtual aDvErsarial Active Learning (IDEAL) algorithm to further investigate SSL-AL's potential superiority and achieve mutual enhancement of AL and SSL, i.e., SSL propagates label information to unlabeled samples and provides smoothed embeddings for AL, while AL excludes samples with inconsistent predictions and considerable uncertainty for SSL. We estimate unlabeled samples' inconsistency by augmentation strategies of different granularities, including fine-grained continuous perturbation exploration and coarse-grained data transformations. Extensive experiments, in both text and image domains, validate the effectiveness of the proposed algorithm, comparing it against state-of-the-art baselines. Two real-world case studies visualize the practical industrial value of applying and deploying the proposed data sampling algorithm.
    Boundary informed inverse PDE problems on discrete Riemann surfaces. (arXiv:2206.02911v1 [math.NA])
    We employ neural networks to tackle inverse partial differential equations on discretized Riemann surfaces with boundary. To this end, we introduce the concept of a graph with boundary which models these surfaces in a natural way. Our method uses a message passing technique to keep track of an unknown differential operator while using neural ODE solvers through the method of lines to capture the evolution in time. As training data, we use noisy and incomplete observations of sheaves on graphs at various timestamps. The novelty of this approach is in working with manifolds with nontrivial topology and utilizing the data on the graph boundary through a teacher forcing technique. Despite the increasing interest in learning dynamical systems from finite observations, many current methods are limited in two general ways: first, they work with topologically trivial spaces, and second, they fail to handle the boundary data on the ground space in a systematic way. The present work is an attempt at addressing these limitations. We run experiments with synthetic data of linear and nonlinear diffusion systems on orientable surfaces with positive genus and boundary, and moreover, provide evidences for improvements upon the existing paradigms.
    Fooling Explanations in Text Classifiers. (arXiv:2206.03178v1 [cs.LG])
    State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.
    Impossibility of Collective Intelligence. (arXiv:2206.02786v1 [cs.LG])
    Democratization of AI involves training and deploying machine learning models across heterogeneous and potentially massive environments. Diversity of data opens up a number of possibilities to advance AI systems, but also introduces pressing concerns such as privacy, security, and equity that require special attention. This work shows that it is theoretically impossible to design a rational learning algorithm that has the ability to successfully learn across heterogeneous environments, which we decoratively call collective intelligence (CI). By representing learning algorithms as choice correspondences over a hypothesis space, we are able to axiomatize them with essential properties. Unfortunately, the only feasible algorithm compatible with all of the axioms is the standard empirical risk minimization (ERM) which learns arbitrarily from a single environment. Our impossibility result reveals informational incomparability between environments as one of the foremost obstacles for researchers who design novel algorithms that learn from multiple environments, which sheds light on prerequisites for success in critical areas of machine learning such as out-of-distribution generalization, federated learning, algorithmic fairness, and multi-modal learning.
    Graph Rationalization with Environment-based Augmentations. (arXiv:2206.02886v1 [cs.LG])
    Rationale is defined as a subset of input features that best explains or supports the prediction by machine learning models. Rationale identification has improved the generalizability and interpretability of neural networks on vision and language data. In graph applications such as molecule and polymer property prediction, identifying representative subgraph structures named as graph rationales plays an essential role in the performance of graph neural networks. Existing graph pooling and/or distribution intervention methods suffer from lack of examples to learn to identify optimal graph rationales. In this work, we introduce a new augmentation operation called environment replacement that automatically creates virtual data examples to improve rationale identification. We propose an efficient framework that performs rationale-environment separation and representation learning on the real and augmented examples in latent spaces to avoid the high complexity of explicit graph decoding and encoding. Comparing against recent techniques, experiments on seven molecular and four polymer real datasets demonstrate the effectiveness and efficiency of the proposed augmentation-based graph rationalization framework.
    Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds. (arXiv:2206.02834v1 [cs.LG])
    We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $\alpha$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(\alpha+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $\alpha$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.
    Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable Data. (arXiv:2206.02909v1 [eess.SP])
    Advances in deep learning for human activity recognition have been relatively limited due to the lack of large labelled datasets. In this study, we leverage self-supervised learning techniques on the UK-Biobank activity tracker dataset--the largest of its kind to date--containing more than 700,000 person-days of unlabelled wearable sensor data. Our resulting activity recognition model consistently outperformed strong baselines across seven benchmark datasets, with an F1 relative improvement of 2.5%-100% (median 18.4%), the largest improvements occurring in the smaller datasets. In contrast to previous studies, our results generalise across external datasets, devices, and environments. Our open-source model will help researchers and developers to build customisable and generalisable activity classifiers with high performance.
    A Simple and Optimal Policy Design for Online Learning with Safety against Heavy-tailed Risk. (arXiv:2206.02969v1 [stat.ML])
    We design simple and optimal policies that ensure safety against heavy-tailed risk in the classical multi-armed bandit problem. We start by showing that some widely used policies such as the standard Upper Confidence Bound policy and the Thompson Sampling policy incur heavy-tailed risk; that is, the worst-case probability of incurring a linear regret slowly decays at a polynomial rate of $1/T$, where $T$ is the time horizon. We further show that this heavy-tailed risk exists for all "instance-dependent consistent" policies. To ensure safety against such heavy-tailed risk, for the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an exponential rate $\exp(-\Omega(\sqrt{T}))$. We further prove that this exponential decaying rate of the tail probability is optimal across all policies that have worst-case optimality for the expected regret. Finally, we improve the policy design and analysis to the general $K$-armed bandit setting. We provide detailed characterization of the tail probability bound for any regret threshold under our policy design. Namely, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. Numerical experiments are conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk are compatible.
    Universal Speech Enhancement with Score-based Diffusion. (arXiv:2206.03065v1 [cs.SD])
    Removing background noise from speech audio has been the subject of considerable research and effort, especially in recent years due to the rise of virtual communication and amateur sound recording. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.
    Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification. (arXiv:2206.03345v1 [math.OC])
    We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
    Better Best of Both Worlds Bounds for Bandits with Switching Costs. (arXiv:2206.03098v1 [cs.LG])
    We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021. We introduce a surprisingly simple and effective algorithm that simultaneously achieves minimax optimal regret bound of $\mathcal{O}(T^{2/3})$ in the oblivious adversarial setting and a bound of $\mathcal{O}(\min\{\log (T)/\Delta^2,T^{2/3}\})$ in the stochastically-constrained regime, both with (unit) switching costs, where $\Delta$ is the gap between the arms. In the stochastically constrained case, our bound improves over previous results due to Rouyer et al., that achieved regret of $\mathcal{O}(T^{1/3}/\Delta)$. We accompany our results with a lower bound showing that, in general, $\tilde{\Omega}(\min\{1/\Delta^2,T^{2/3}\})$ regret is unavoidable in the stochastically-constrained case for algorithms with $\mathcal{O}(T^{2/3})$ worst-case regret.
    Conditional Seq2Seq model for the time-dependent two-level system. (arXiv:2206.02889v1 [quant-ph])
    We apply the deep learning neural network architecture to the two-level system in quantum optics to solve the time-dependent Schrodinger equation. By carefully designing the network structure and tuning parameters, above 90 percent accuracy in super long-term predictions can be achieved in the case of random electric fields, which indicates a promising new method to solve the time-dependent equation for two-level systems. By slightly modifying this network, we think that this method can solve the two- or three-dimensional time-dependent Schrodinger equation more efficiently than traditional approaches.
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v1 [stat.ML])
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimator from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that the greedy policy achieves, after $T$ rounds, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. We finally show with a proof of concept simulation that our policy achieves sub-linear fair pseudo-regret also in practice.
    Neuro-Nav: A Library for Neurally-Plausible Reinforcement Learning. (arXiv:2206.03312v1 [cs.NE])
    In this work we propose Neuro-Nav, an open-source library for neurally plausible reinforcement learning (RL). RL is among the most common modeling frameworks for studying decision making, learning, and navigation in biological organisms. In utilizing RL, cognitive scientists often handcraft environments and agents to meet the needs of their particular studies. On the other hand, artificial intelligence researchers often struggle to find benchmarks for neurally and biologically plausible representation and behavior (e.g., in decision making or navigation). In order to streamline this process across both fields with transparency and reproducibility, Neuro-Nav offers a set of standardized environments and RL algorithms drawn from canonical behavioral and neural studies in rodents and humans. We demonstrate that the toolkit replicates relevant findings from a number of studies across both cognitive science and RL literatures. We furthermore describe ways in which the library can be extended with novel algorithms (including deep RL) and environments to address future research needs of the field.
    Group privacy for personalized federated learning. (arXiv:2206.03396v1 [cs.LG])
    Federated learning is a type of collaborative machine learning, where participating clients process their data locally, sharing only updates to the collaborative model. This enables to build privacy-aware distributed machine learning models, among others. The goal is the optimization of a statistical model's parameters by minimizing a cost function of a collection of datasets which are stored locally by a set of clients. This process exposes the clients to two issues: leakage of private information and lack of personalization of the model. On the other hand, with the recent advancements in techniques to analyze data, there is a surge of concern for the privacy violation of the participating clients. To mitigate this, differential privacy and its variants serve as a standard for providing formal privacy guarantees. Often the clients represent very heterogeneous communities and hold data which are very diverse. Therefore, aligned with the recent focus of the FL community to build a framework of personalized models for the users representing their diversity, it is also of utmost importance to protect against potential threats against the sensitive and personal information of the clients. $d$-privacy, which is a generalization of geo-indistinguishability, the lately popularized paradigm of location privacy, uses a metric-based obfuscation technique that preserves the spatial distribution of the original data. To address the issue of protecting the privacy of the clients and allowing for personalized model training to enhance the fairness and utility of the system, we propose a method to provide group privacy guarantees exploiting some key properties of $d$-privacy which enables personalized models under the framework of FL. We provide with theoretical justifications to the applicability and experimental validation on real-world datasets to illustrate the working of the proposed method.
    Imitating Past Successes can be Very Suboptimal. (arXiv:2206.03378v1 [cs.LG])
    Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we prove that existing outcome-conditioned imitation learning methods do not necessarily improve the policy; rather, in some settings they can decrease the expected reward. Nonetheless, we show that a simple modification results in a method that does guarantee policy improvement, under some assumptions. Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
    Efficient decentralized multi-agent learning in asymmetric queuing systems. (arXiv:2206.03324v1 [cs.LG])
    We study decentralized multi-agent learning in bipartite queuing systems, a standard model for service systems. In particular, $N$ agents request service from $K$ servers in a fully decentralized way, i.e, by running the same algorithm without communication. Previous decentralized algorithms are restricted to symmetric systems, have performance that is degrading exponentially in the number of servers, require communication through shared randomness and unique agent identities, and are computationally demanding. In contrast, we provide a simple learning algorithm that, when run decentrally by each agent, leads the queuing system to have efficient performance in general asymmetric bipartite queuing systems while also having additional robustness properties. Along the way, we provide the first UCB-based algorithm for the centralized case of the problem, which resolves an open question by Krishnasamy et al. (2016,2021).
    Marvolo: Programmatic Data Augmentation for Practical ML-Driven Malware Detection. (arXiv:2206.03265v1 [cs.CR])
    Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data. This shortfall is particularly onerous given the unique difficulty of acquiring benign and malicious training data that runs into copyright restrictions, and that institutions like banks and governments receive targeted malware that will never exist in large quantities. We present MARVOLO, a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors. MARVOLO employs semantics-preserving code transformations that mimic the alterations that malware authors and defensive benign developers routinely make in practice , allowing us to generate meaningful augmented data. Crucially, semantics-preserving transformations also enable MARVOLO to safely propagate labels from original to newly-generated data samples without mandating expensive reverse engineering of binaries. Further, MARVOLO embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget. Experiments using wide-ranging commercial malware datasets and a recent ML-driven malware detector show that MARVOLO boosts accuracies by up to 5%, while operating on only a small fraction (15%) of the potential input binaries.
    Generalization Error Bounds for Deep Neural Networks Trained by SGD. (arXiv:2206.03299v1 [cs.LG])
    Generalization error bounds for deep neural networks trained by stochastic gradient descent (SGD) are derived by combining a dynamical control of an appropriate parameter norm and the Rademacher complexity estimate based on parameter norms. The bounds explicitly depend on the loss along the training trajectory, and work for a wide range of network architectures including multilayer perceptron (MLP) and convolutional neural networks (CNN). Compared with other algorithm-depending generalization estimates such as uniform stability-based bounds, our bounds do not require $L$-smoothness of the nonconvex loss function, and apply directly to SGD instead of Stochastic Langevin gradient descent (SGLD). Numerical results show that our bounds are non-vacuous and robust with the change of optimizer and network hyperparameters.
    AS2T: Arbitrary Source-To-Target Adversarial Attack on Speaker Recognition Systems. (arXiv:2206.03351v1 [cs.SD])
    Recent work has illuminated the vulnerability of speaker recognition systems (SRSs) against adversarial attacks, raising significant security concerns in deploying SRSs. However, they considered only a few settings (e.g., some combinations of source and target speakers), leaving many interesting and important settings in real-world attack scenarios alone. In this work, we present AS2T, the first attack in this domain which covers all the settings, thus allows the adversary to craft adversarial voices using arbitrary source and target speakers for any of three main recognition tasks. Since none of the existing loss functions can be applied to all the settings, we explore many candidate loss functions for each setting including the existing and newly designed ones. We thoroughly evaluate their efficacy and find that some existing loss functions are suboptimal. Then, to improve the robustness of AS2T towards practical over-the-air attack, we study the possible distortions occurred in over-the-air transmission, utilize different transformation functions with different parameters to model those distortions, and incorporate them into the generation of adversarial voices. Our simulated over-the-air evaluation validates the effectiveness of our solution in producing robust adversarial voices which remain effective under various hardware devices and various acoustic environments with different reverberation, ambient noises, and noise levels. Finally, we leverage AS2T to perform thus far the largest-scale evaluation to understand transferability among 14 diverse SRSs. The transferability analysis provides many interesting and useful insights which challenge several findings and conclusion drawn in previous works in the image domain. Our study also sheds light on future directions of adversarial attacks in the speaker recognition domain.
    Decentralized Low-Latency Collaborative Inference via Ensembles on the Edge. (arXiv:2206.03165v1 [cs.LG])
    The success of deep neural networks (DNNs) is heavily dependent on computational resources. While DNNs are often employed on cloud servers, there is a growing need to operate DNNs on edge devices. Edge devices are typically limited in their computational resources, yet, often multiple edge devices are deployed in the same environment and can reliably communicate with each other. In this work we propose to facilitate the application of DNNs on the edge by allowing multiple users to collaborate during inference to improve their accuracy. Our mechanism, coined {\em edge ensembles}, is based on having diverse predictors at each device, which form an ensemble of models during inference. To mitigate the communication overhead, the users share quantized features, and we propose a method for aggregating multiple decisions into a single inference rule. We analyze the latency induced by edge ensembles, showing that its performance improvement comes at the cost of a minor additional delay under common assumptions on the communication network. Our experiments demonstrate that collaborative inference via edge ensembles equipped with compact DNNs substantially improves the accuracy over having each user infer locally, and can outperform using a single centralized DNN larger than all the networks in the ensemble together.
    Subject Membership Inference Attacks in Federated Learning. (arXiv:2206.03317v1 [cs.LG])
    Privacy in Federated Learning (FL) is studied at two different granularities: item-level, which protects individual data points, and user-level, which protects each user (participant) in the federation. Nearly all of the private FL literature is dedicated to studying privacy attacks and defenses at these two granularities. Recently, subject-level privacy has emerged as an alternative privacy granularity to protect the privacy of individuals (data subjects) whose data is spread across multiple (organizational) users in cross-silo FL settings. An adversary might be interested in recovering private information about these individuals (a.k.a. \emph{data subjects}) by attacking the trained model. A systematic study of these patterns requires complete control over the federation, which is impossible with real-world datasets. We design a simulator for generating various synthetic federation configurations, enabling us to study how properties of the data, model design and training, and the federation itself impact subject privacy risk. We propose three attacks for \emph{subject membership inference} and examine the interplay between all factors within a federation that affect the attacks' efficacy. We also investigate the effectiveness of Differential Privacy in mitigating this threat. Our takeaways generalize to real-world datasets like FEMNIST, giving credence to our findings.
    GRETEL: A unified framework for Graph Counterfactual Explanation Evaluation. (arXiv:2206.02957v1 [cs.LG])
    Machine Learning (ML) systems are a building part of the modern tools which impact our daily life in several application domains. Due to their black-box nature, those systems are hardly adopted in application domains (e.g. health, finance) where understanding the decision process is of paramount importance. Explanation methods were developed to explain how the ML model has taken a specific decision for a given case/instance. Graph Counterfactual Explanations (GCE) is one of the explanation techniques adopted in the Graph Learning domain. The existing works of Graph Counterfactual Explanations diverge mostly in the problem definition, application domain, test data, and evaluation metrics, and most existing works do not compare exhaustively against other counterfactual explanation techniques present in the literature. We present GRETEL, a unified framework to develop and test GCE methods in several settings. GRETEL is a highly extensible evaluation framework which promotes the Open Science and the evaluations reproducibility by providing a set of well-defined mechanisms to integrate and manage easily: both real and synthetic datasets, ML models, state-of-the-art explanation techniques, and evaluation measures. To present GRETEL, we show the experiments conducted to integrate and test several synthetic and real datasets with several existing explanation techniques and base ML models.
    Recent Advances in Bayesian Optimization. (arXiv:2206.03301v1 [cs.LG])
    Bayesian optimization has emerged at the forefront of expensive black-box optimization due to its data efficiency. Recent years have witnessed a proliferation of studies on the development of new Bayesian optimization algorithms and their applications. Hence, this paper attempts to provide a comprehensive and updated survey of recent advances in Bayesian optimization and identify interesting open problems. We categorize the existing work on Bayesian optimization into nine main groups according to the motivations and focus of the proposed algorithms. For each category, we present the main advances with respect to the construction of surrogate models and adaptation of the acquisition functions. Finally, we discuss the open questions and suggest promising future research directions, in particular with regard to heterogeneity, privacy preservation, and fairness in distributed and federated optimization systems.
    Self-Knowledge Distillation based Self-Supervised Learning for Covid-19 Detection from Chest X-Ray Images. (arXiv:2206.03009v1 [eess.IV])
    The global outbreak of the Coronavirus 2019 (COVID-19) has overloaded worldwide healthcare systems. Computer-aided diagnosis for COVID-19 fast detection and patient triage is becoming critical. This paper proposes a novel self-knowledge distillation based self-supervised learning method for COVID-19 detection from chest X-ray images. Our method can use self-knowledge of images based on similarities of their visual features for self-supervised learning. Experimental results show that our method achieved an HM score of 0.988, an AUC of 0.999, and an accuracy of 0.957 on the largest open COVID-19 chest X-ray dataset.
    Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics. (arXiv:2206.02972v1 [stat.ML])
    Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity, or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time-series data as a sparse combination of simpler, more interpretable components. The decomposed nature of the dynamics generalizes over previous switched approaches and enables modeling of overlapping and non-stationary drifts in the dynamics. We further present a dictionary learning-driven approach to model fitting, where we leverage recent results in tracking sparse vectors over time. We demonstrate that our model can learn efficient representations and smooth transitions between dynamical modes in both continuous-time and discrete-time examples. We show results on low-dimensional linear and nonlinear attractors to demonstrate that our decomposed dynamical systems model can well approximate nonlinear dynamics. Additionally, we apply our model to C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states.
    Recall Distortion in Neural Network Pruning and the Undecayed Pruning Algorithm. (arXiv:2206.02976v1 [cs.LG])
    Pruning techniques have been successfully used in neural networks to trade accuracy for sparsity. However, the impact of network pruning is not uniform: prior work has shown that the recall for underrepresented classes in a dataset may be more negatively affected. In this work, we study such relative distortions in recall by hypothesizing an intensification effect that is inherent to the model. Namely, that pruning makes recall relatively worse for a class with recall below accuracy and, conversely, that it makes recall relatively better for a class with recall above accuracy. In addition, we propose a new pruning algorithm aimed at attenuating such effect. Through statistical analysis, we have observed that intensification is less severe with our algorithm but nevertheless more pronounced with relatively more difficult tasks, less complex models, and higher pruning ratios. More surprisingly, we conversely observe a de-intensification effect with lower pruning ratios.
    An Empirical Study of IoT Security Aspects at Sentence-Level in Developer Textual Discussions. (arXiv:2206.03079v1 [cs.CR])
    IoT is a rapidly emerging paradigm that now encompasses almost every aspect of our modern life. As such, ensuring the security of IoT devices is crucial. IoT devices can differ from traditional computing, thereby the design and implementation of proper security measures can be challenging in IoT devices. We observed that IoT developers discuss their security-related challenges in developer forums like Stack Overflow(SO). However, we find that IoT security discussions can also be buried inside non-security discussions in SO. In this paper, we aim to understand the challenges IoT developers face while applying security practices and techniques to IoT devices. We have two goals: (1) Develop a model that can automatically find security-related IoT discussions in SO, and (2) Study the model output to learn about IoT developer security-related challenges. First, we download 53K posts from SO that contain discussions about IoT. Second, we manually labeled 5,919 sentences from 53K posts as 1 or 0. Third, we use this benchmark to investigate a suite of deep learning transformer models. The best performing model is called SecBot. Fourth, we apply SecBot on the entire posts and find around 30K security related sentences. Fifth, we apply topic modeling to the security-related sentences. Then we label and categorize the topics. Sixth, we analyze the evolution of the topics in SO. We found that (1) SecBot is based on the retraining of the deep learning model RoBERTa. SecBot offers the best F1-Score of 0.935, (2) there are six error categories in misclassified samples by SecBot. SecBot was mostly wrong when the keywords/contexts were ambiguous (e.g., gateway can be a security gateway or a simple gateway), (3) there are 9 security topics grouped into three categories: Software, Hardware, and Network, and (4) the highest number of topics belongs to software security, followed by network security.
    Histogram Estimation under User-level Privacy with Heterogeneous Data. (arXiv:2206.03008v1 [cs.LG])
    We study the problem of histogram estimation under user-level differential privacy, where the goal is to preserve the privacy of all entries of any single user. While there is abundant literature on this classical problem under the item-level privacy setup where each user contributes only one data point, little has been known for the user-level counterpart. We consider the heterogeneous scenario where both the quantity and distribution of data can be different for each user. We propose an algorithm based on a clipping strategy that almost achieves a two-approximation with respect to the best clipping threshold in hindsight. This result holds without any distribution assumptions on the data. We also prove that the clipping bias can be significantly reduced when the counts are from non-i.i.d. Poisson distributions and show empirically that our debiasing method provides improvements even without such constraints. Experiments on both real and synthetic datasets verify our theoretical findings and demonstrate the effectiveness of our algorithms.
    RORL: Robust Offline Reinforcement Learning via Conservative Smoothing. (arXiv:2206.02829v1 [cs.LG])
    Offline reinforcement learning (RL) provides a promising direction to exploit the massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative for value estimation and action selection. However, such conservatism impairs the robustness of learned policies, leading to a significant change even for a small perturbation on observations. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset and additional conservative value estimation on these OOD states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve the state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbation.
    CitySpec: An Intelligent Assistant System for Requirement Specification in Smart Cities. (arXiv:2206.03132v1 [cs.AI])
    An increasing number of monitoring systems have been developed in smart cities to ensure that real-time operations of a city satisfy safety and performance requirements. However, many existing city requirements are written in English with missing, inaccurate, or ambiguous information. There is a high demand for assisting city policy makers in converting human-specified requirements to machine-understandable formal specifications for monitoring systems. To tackle this limitation, we build CitySpec, the first intelligent assistant system for requirement specification in smart cities. To create CitySpec, we first collect over 1,500 real-world city requirements across different domains from over 100 cities and extract city-specific knowledge to generate a dataset of city vocabulary with 3,061 words. We also build a translation model and enhance it through requirement synthesis and develop a novel online learning framework with validation under uncertainty. The evaluation results on real-world city requirements show that CitySpec increases the sentence-level accuracy of requirement specification from 59.02% to 86.64%, and has strong adaptability to a new city and a new domain (e.g., F1 score for requirements in Seattle increases from 77.6% to 93.75% with online learning).
    Instance-Dependent Label-Noise Learning with Manifold-Regularized Transition Matrix Estimation. (arXiv:2206.02791v1 [cs.LG])
    In label-noise learning, estimating the transition matrix has attracted more and more attention as the matrix plays an important role in building statistically consistent classifiers. However, it is very challenging to estimate the transition matrix T(x), where x denotes the instance, because it is unidentifiable under the instance-dependent noise(IDN). To address this problem, we have noticed that, there are psychological and physiological evidences showing that we humans are more likely to annotate instances of similar appearances to the same classes, and thus poor-quality or ambiguous instances of similar appearances are easier to be mislabeled to the correlated or same noisy classes. Therefore, we propose assumption on the geometry of T(x) that "the closer two instances are, the more similar their corresponding transition matrices should be". More specifically, we formulate above assumption into the manifold embedding, to effectively reduce the degree of freedom of T(x) and make it stably estimable in practice. The proposed manifold-regularized technique works by directly reducing the estimation error without hurting the approximation error about the estimation problem of T(x). Experimental evaluations on four synthetic and two real-world datasets demonstrate that our method is superior to state-of-the-art approaches for label-noise learning under the challenging IDN.
    Predicting Electricity Infrastructure Induced Wildfire Risk in California. (arXiv:2206.02930v1 [eess.SY])
    This paper examines the use of risk models to predict the timing and location of wildfires caused by electricity infrastructure. Our data include historical ignition and wire-down points triggered by grid infrastructure collected between 2015 to 2019 in Pacific Gas & Electricity territory along with various weather, vegetation, and very high resolution data on grid infrastructure including location, age, materials. With these data we explore a range of machine learning methods and strategies to manage training data imbalance. The best area under the receiver operating characteristic we obtain is 0.776 for distribution feeder ignitions and 0.824 for transmission line wire-down events, both using the histogram-based gradient boosting tree algorithm (HGB) with under-sampling. We then use these models to identify which information provides the most predictive value. After line length, we find that weather and vegetation features dominate the list of top important features for ignition or wire-down risk. Distribution ignition models show more dependence on slow-varying vegetation variables such as burn index, energy release content, and tree height, whereas transmission wire-down models rely more on primary weather variables such as wind speed and precipitation. These results point to the importance of improved vegetation modeling for feeder ignition risk models, and improved weather forecasting for transmission wire-down models. We observe that infrastructure features make small but meaningful improvements to risk model predictive power.
    Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax Optimization. (arXiv:2206.02953v1 [math.OC])
    We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to sampling with replacement. For the smooth and strongly convex-strongly concave setting, we consider gradient descent ascent and the proximal point method, and present a unified analysis of two popular without-replacement sampling strategies, namely Random Reshuffling (RR), which shuffles the data every epoch, and Single Shuffling or Shuffle Once (SO), which shuffles only at the beginning. We obtain tight convergence rates for RR and SO and demonstrate that these strategies lead to faster convergence than uniform sampling. Moving beyond convexity, we obtain similar results for smooth nonconvex-nonconcave objectives satisfying a two-sided Polyak-{\L}ojasiewicz inequality. Finally, we demonstrate that our techniques are general enough to analyze the effect of data-ordering attacks, where an adversary manipulates the order in which data points are supplied to the optimizer. Our analysis also recovers tight rates for the incremental gradient method, where the data points are not shuffled at all.
    Zeroth-Order SciML: Non-intrusive Integration of Scientific Software with Deep Learning. (arXiv:2206.02785v1 [cs.LG])
    Using deep learning (DL) to accelerate and/or improve scientific workflows can yield discoveries that are otherwise impossible. Unfortunately, DL models have yielded limited success in complex scientific domains due to large data requirements. In this work, we propose to overcome this issue by integrating the abundance of scientific knowledge sources (SKS) with the DL training process. Existing knowledge integration approaches are limited to using differentiable knowledge source to be compatible with first-order DL training paradigm. In contrast, our proposed approach treats knowledge source as a black-box in turn allowing to integrate virtually any knowledge source. To enable an end-to-end training of SKS-coupled-DL, we propose to use zeroth-order optimization (ZOO) based gradient-free training schemes, which is non-intrusive, i.e., does not require making any changes to the SKS. We evaluate the performance of our ZOO training scheme on two real-world material science applications. We show that proposed scheme is able to effectively integrate scientific knowledge with DL training and is able to outperform purely data-driven model for data-limited scientific applications. We also discuss some limitations of the proposed method and mention potentially worthwhile future directions.
    Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks. (arXiv:2206.02916v1 [cs.LG])
    We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories. These memories can then be recalled to quickly re-train a neural network and recover the performance (instead of storing and re-training on the full original dataset). Building upon the dataset distillation framework, we make a key observation that a shared common representation allows for more efficient and effective distillation. Concretely, we learn a set of bases (aka "memories") which are shared between classes and combined through learned flexible addressing functions to generate a diverse set of training examples. This leads to several benefits: 1) the size of compressed data does not necessarily grow linearly with the number of classes; 2) an overall higher compression rate with more effective distillation is achieved; and 3) more generalized queries are allowed beyond recalling the original classes. We demonstrate state-of-the-art results on the dataset distillation task across five benchmarks, including up to 16.5% and 9.7% in retained accuracy improvement when distilling CIFAR10 and CIFAR100 respectively. We then leverage our framework to perform continual learning, achieving state-of-the-art results on four benchmarks, with 23.2% accuracy improvement on MANY.
    Flexible Group Fairness Metrics for Survival Analysis. (arXiv:2206.03256v1 [cs.CY])
    Algorithmic fairness is an increasingly important field concerned with detecting and mitigating biases in machine learning models. There has been a wealth of literature for algorithmic fairness in regression and classification however there has been little exploration of the field for survival analysis. Survival analysis is the prediction task in which one attempts to predict the probability of an event occurring over time. Survival predictions are particularly important in sensitive settings such as when utilising machine learning for diagnosis and prognosis of patients. In this paper we explore how to utilise existing survival metrics to measure bias with group fairness metrics. We explore this in an empirical experiment with 29 survival datasets and 8 measures. We find that measures of discrimination are able to capture bias well whereas there is less clarity with measures of calibration and scoring rules. We suggest further areas for research including prediction-based fairness metrics for distribution predictions.
    Discrete State-Action Abstraction via the Successor Representation. (arXiv:2206.03467v1 [cs.AI])
    When reinforcement learning is applied with sparse rewards, agents must spend a prohibitively long time exploring the unknown environment without any learning signal. Abstraction is one approach that provides the agent with an intrinsic reward for transitioning in a latent space. Prior work focuses on dense continuous latent spaces, or requires the user to manually provide the representation. Our approach is the first for automatically learning a discrete abstraction of the underlying environment. Moreover, our method works on arbitrary input spaces, using an end-to-end trainable regularized successor representation model. For transitions between abstract states, we train a set of temporally extended actions in the form of options, i.e., an action abstraction. Our proposed algorithm, Discrete State-Action Abstraction (DSAA), iteratively swaps between training these options and using them to efficiently explore more of the environment to improve the state abstraction. As a result, our model is not only useful for transfer learning but also in the online learning setting. We empirically show that our agent is able to explore the environment and solve provided tasks more efficiently than baseline reinforcement learning algorithms. Our code is publicly available at \url{https://github.com/amnonattali/dsaa}.
    Goal-Space Planning with Subgoal Models. (arXiv:2206.02902v1 [cs.LG])
    This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.
    A Human-Centric Take on Model Monitoring. (arXiv:2206.02868v1 [cs.LG])
    Predictive models are increasingly used to make various consequential decisions in high-stakes domains such as healthcare, finance, and policy. It becomes critical to ensure that these models make accurate predictions, are robust to shifts in the data, do not rely on spurious features, and do not unduly discriminate against minority groups. To this end, several approaches spanning various areas such as explainability, fairness, and robustness have been proposed in recent literature. Such approaches need to be human-centered as they cater to the understanding of the models to their users. However, there is a research gap in understanding the human-centric needs and challenges of monitoring machine learning (ML) models once they are deployed. To fill this gap, we conducted an interview study with 13 practitioners who have experience at the intersection of deploying ML models and engaging with customers spanning domains such as financial services, healthcare, hiring, online retail, computational advertising, and conversational assistants. We identified various human-centric challenges and requirements for model monitoring in real-world applications. Specifically, we found the need and the challenge for the model monitoring systems to clarify the impact of the monitoring observations on outcomes. Further, such insights must be actionable, robust, customizable for domain-specific use cases, and cognitively considerate to avoid information overload.
    Training Subset Selection for Weak Supervision. (arXiv:2206.02914v1 [stat.ML])
    Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.
    Shuffled Check-in: Privacy Amplification towards Practical Distributed Learning. (arXiv:2206.03151v1 [cs.LG])
    Recent studies of distributed computation with formal privacy guarantees, such as differentially private (DP) federated learning, leverage random sampling of clients in each round (privacy amplification by subsampling) to achieve satisfactory levels of privacy. Achieving this however requires strong assumptions which may not hold in practice, including precise and uniform subsampling of clients, and a highly trusted aggregator to process clients' data. In this paper, we explore a more practical protocol, shuffled check-in, to resolve the aforementioned issues. The protocol relies on client making independent and random decision to participate in the computation, freeing the requirement of server-initiated subsampling, and enabling robust modelling of client dropouts. Moreover, a weaker trust model known as the shuffle model is employed instead of using a trusted aggregator. To this end, we introduce new tools to characterize the R\'enyi differential privacy (RDP) of shuffled check-in. We show that our new techniques improve at least three times in privacy guarantee over those using approximate DP's strong composition at various parameter regimes. Furthermore, we provide a numerical approach to track the privacy of generic shuffled check-in mechanism including distributed stochastic gradient descent (SGD) with Gaussian mechanism. To the best of our knowledge, this is also the first evaluation of Gaussian mechanism within the local/shuffle model under the distributed setting in the literature, which can be of independent interest.
  • Open

    Improving Mini-batch Optimal Transport via Partial Transportation. (arXiv:2108.09645v4 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications such as deep domain adaptation, partial domain adaptation, deep generative model, color transfer, and gradient flow to demonstrate the favorable performance of m-POT compared to current mini-batch methods.  ( 2 min )
    Unbiased estimators for random design regression. (arXiv:1907.03411v2 [stat.ML] UPDATED)
    In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points, which is a significant bottleneck in model averaging. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of $d$ points, drawn jointly according to a certain determinantal point process constructed from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum. We provide efficient algorithms for generating such unbiased estimators in a number of practical settings and support our claims experimentally.  ( 2 min )
    Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions. (arXiv:2206.03254v1 [cs.LG])
    This theoretical paper is devoted to developing a rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-parameterized Rectified Linear Unit (ReLU) nets for very high dimensional dataset under very mild assumptions. A major ingredient of our analysis is a fine-grained analysis of random activation matrices. The essential virtue of dissecting activation matrices is that it bridges the dynamics of optimization and angular distribution in high-dimensional data space. This angle-based detailed analysis leads to asymptotic characterizations of gradient norm and directional curvature of objective function at each gradient descent iteration, revealing that the empirical loss function enjoys nice geometrical properties in the overparameterized setting. Along the way, we significantly improve existing theoretical bounds on both over-parameterization condition and learning rate with very mild assumptions for learning very high dimensional data. Moreover, we uncover the role of the geometrical and spectral properties of the input data in determining desired over-parameterization size and global convergence rate. All these clues allow us to discover a novel geometric picture of nonconvex optimization in deep learning: angular distribution in high-dimensional data space $\mapsto$ spectrums of overparameterized activation matrices $\mapsto$ favorable geometrical properties of empirical loss landscape $\mapsto$ global convergence phenomenon. Furthremore, our theoretical results imply that gradient-based nonconvex optimization algorithms have much stronger statistical guarantees with much milder over-parameterization condition than exisiting theory states for learning very high dimensional data, which is rarely explored so far.
    Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification. (arXiv:2206.03345v1 [math.OC])
    We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
    Beyond spectral gap: The role of the topology in decentralized learning. (arXiv:2206.03093v1 [cs.LG])
    In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which all workers sample from the same dataset, and communicate over a sparse graph (decentralized). In this setting, current theory fails to capture important aspects of real-world behavior. First, the 'spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.
    Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics. (arXiv:2206.02972v1 [stat.ML])
    Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity, or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time-series data as a sparse combination of simpler, more interpretable components. The decomposed nature of the dynamics generalizes over previous switched approaches and enables modeling of overlapping and non-stationary drifts in the dynamics. We further present a dictionary learning-driven approach to model fitting, where we leverage recent results in tracking sparse vectors over time. We demonstrate that our model can learn efficient representations and smooth transitions between dynamical modes in both continuous-time and discrete-time examples. We show results on low-dimensional linear and nonlinear attractors to demonstrate that our decomposed dynamical systems model can well approximate nonlinear dynamics. Additionally, we apply our model to C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states.
    Learning Backward Compatible Embeddings. (arXiv:2206.03040v1 [stat.ML])
    Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product recommendation). The produced embeddings are then widely consumed by consumer teams to solve their unintended tasks (e.g., fraud detection). However, as the embedding model gets updated and retrained to improve performance on the intended task, the newly-generated embeddings are no longer compatible with the existing consumer models. This means that historical versions of the embeddings can never be retired or all consumer teams have to retrain their models to make them compatible with the latest version of the embeddings, both of which are extremely costly in practice. Here we study the problem of embedding version updates and their backward compatibility. We formalize the problem where the goal is for the embedding team to keep updating the embedding version, while the consumer teams do not have to retrain their models. We develop a solution based on learning backward compatible embeddings, which allows the embedding model version to be updated frequently, while also allowing the latest version of the embedding to be quickly transformed into any backward compatible historical version of it, so that consumer teams do not have to retrain their models. Under our framework, we explore six methods and systematically evaluate them on a real-world recommender system application. We show that the best method, which we call BC-Aligner, maintains backward compatibility with existing unintended tasks even after multiple model version updates. Simultaneously, BC-Aligner achieves the intended task performance similar to the embedding model that is solely optimized for the intended task.
    FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data. (arXiv:2206.02792v1 [cs.LG])
    Algorithmic fairness plays an important role in machine learning and imposing fairness constraints during learning is a common approach. However, many datasets are imbalanced in certain label classes (e.g. "healthy") and sensitive subgroups (e.g. "older patients"). Empirically, this imbalance leads to a lack of generalizability not only of classification, but also of fairness properties, especially in over-parameterized models. For example, fairness-aware training may ensure equalized odds (EO) on the training data, but EO is far from being satisfied on new users. In this paper, we propose a theoretically-principled, yet Flexible approach that is Imbalance-Fairness-Aware (FIFA). Specifically, FIFA encourages both classification and fairness generalization and can be flexibly combined with many existing fair learning methods with logits-based losses. While our main focus is on EO, FIFA can be directly applied to achieve equalized opportunity (EqOpt); and under certain conditions, it can also be applied to other fairness notions. We demonstrate the power of FIFA by combining it with a popular fair classification algorithm, and the resulting algorithm achieves significantly better fairness generalization on several real-world datasets.
    Inferring Unfairness and Error from Population Statistics in Binary and Multiclass Classification. (arXiv:2206.03234v1 [cs.LG])
    We propose methods for making inferences on the fairness and accuracy of a given classifier, using only aggregate population statistics. This is necessary when it is impossible to obtain individual classification data, for instance when there is no access to the classifier or to a representative individual-level validation set. We study fairness with respect to the equalized odds criterion, which we generalize to multiclass classification. We propose a measure of unfairness with respect to this criterion, which quantifies the fraction of the population that is treated unfairly. We then show how inferences on the unfairness and error of a given classifier can be obtained using only aggregate label statistics such as the rate of prediction of each label in each sub-population, as well as the true rate of each label. We derive inference procedures for binary classifiers and for multiclass classifiers, for the case where confusion matrices in each sub-population are known, and for the significantly more challenging case where they are unknown. We report experiments on data sets representing diverse applications, which demonstrate the effectiveness and the wide range of possible uses of the proposed methodology.
    Plant 'n' Seek: Can You Find the Winning Ticket?. (arXiv:2111.11153v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis has sparked the rapid development of pruning algorithms that aim to reduce the computational costs associated with deep learning during training and model deployment. Currently, such algorithms are primarily evaluated on imaging data, for which we lack ground truth information and thus the understanding of how sparse lottery tickets could be. To fill this gap, we develop a framework that allows us to plant and hide winning tickets with desirable properties in randomly initialized neural networks. To analyze the ability of state-of-the-art pruning to identify tickets of extreme sparsity, we design and hide such tickets solving four challenging tasks. In extensive experiments, we observe similar trends as in imaging studies, indicating that our framework can provide transferable insights into realistic problems. Additionally, we can now see beyond such relative trends and highlight limitations of current pruning methods. Based on our results, we conclude that the current limitations in ticket sparsity are likely of algorithmic rather than fundamental nature. We anticipate that comparisons to planted tickets will facilitate future developments of efficient pruning algorithms.
    A Simple and Optimal Policy Design for Online Learning with Safety against Heavy-tailed Risk. (arXiv:2206.02969v1 [stat.ML])
    We design simple and optimal policies that ensure safety against heavy-tailed risk in the classical multi-armed bandit problem. We start by showing that some widely used policies such as the standard Upper Confidence Bound policy and the Thompson Sampling policy incur heavy-tailed risk; that is, the worst-case probability of incurring a linear regret slowly decays at a polynomial rate of $1/T$, where $T$ is the time horizon. We further show that this heavy-tailed risk exists for all "instance-dependent consistent" policies. To ensure safety against such heavy-tailed risk, for the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an exponential rate $\exp(-\Omega(\sqrt{T}))$. We further prove that this exponential decaying rate of the tail probability is optimal across all policies that have worst-case optimality for the expected regret. Finally, we improve the policy design and analysis to the general $K$-armed bandit setting. We provide detailed characterization of the tail probability bound for any regret threshold under our policy design. Namely, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. Numerical experiments are conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk are compatible.
    Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks. (arXiv:2206.02887v1 [cs.LG])
    We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.
    Robust Sparse Mean Estimation via Sum of Squares. (arXiv:2206.03441v1 [cs.DS])
    We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with "certifiably bounded" $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.
    Confounder Analysis in Measuring Representation in Product Funnels. (arXiv:2206.02962v1 [stat.ML])
    This paper discusses an application of Shapley values in the causal inference field, specifically on how to select the top confounder variables for coarsened exact matching method in a scalable way. We use a dataset from an observational experiment involving LinkedIn members as a use case to test its applicability, and show that Shapley values are highly informational and can be leveraged for its robust importance-ranking capability.
    Training Subset Selection for Weak Supervision. (arXiv:2206.02914v1 [stat.ML])
    Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.
    Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances. (arXiv:2206.03230v1 [stat.ML])
    The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.
    Concentration analysis of multivariate elliptic diffusion processes. (arXiv:2206.03329v1 [math.PR])
    We prove concentration inequalities and associated PAC bounds for continuous- and discrete-time additive functionals for possibly unbounded functions of multivariate, nonreversible diffusion processes. Our analysis relies on an approach via the Poisson equation allowing us to consider a very broad class of subexponentially ergodic processes. These results add to existing concentration inequalities for additive functionals of diffusion processes which have so far been only available for either bounded functions or for unbounded functions of processes from a significantly smaller class. We demonstrate the power of these exponential inequalities by two examples of very different areas. Considering a possibly high-dimensional parametric nonlinear drift model under sparsity constraints, we apply the continuous-time concentration results to validate the restricted eigenvalue condition for Lasso estimation, which is fundamental for the derivation of oracle inequalities. The results for discrete additive functionals are used to investigate the unadjusted Langevin MCMC algorithm for sampling of moderately heavy-tailed densities $\pi$. In particular, we provide PAC bounds for the sample Monte Carlo estimator of integrals $\pi(f)$ for polynomially growing functions $f$ that quantify sufficient sample and step sizes for approximation within a prescribed margin with high probability.
    Adaptive Regularization for Adversarial Training. (arXiv:2206.03353v1 [stat.ML])
    Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to use a data-adaptive regularization for robustifying a prediction model. We apply more regularization to data which are more vulnerable to adversarial attacks and vice versa. Even though the idea of data-adaptive regularization is not new, our data-adaptive regularization has a firm theoretical base of reducing an upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on clean samples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.
    On the Convergence of Optimizing Persistent-Homology-Based Losses. (arXiv:2206.02946v1 [cs.LG])
    Topological loss based on persistent homology has shown promise in various applications. A topological loss enforces the model to achieve certain desired topological property. Despite its empirical success, less is known about the optimization behavior of the loss. In fact, the topological loss involves combinatorial configurations that may oscillate during optimization. In this paper, we introduce a general purpose regularized topology-aware loss. We propose a novel regularization term and also modify existing topological loss. These contributions lead to a new loss function that not only enforces the model to have desired topological behavior, but also achieves satisfying convergence behavior. Our main theoretical result guarantees that the loss can be optimized efficiently, under mild assumptions.
    Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints. (arXiv:2202.01661v2 [cs.CY] UPDATED)
    In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v2 [cs.LG] UPDATED)
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. Surprisingly, we show that with no communication at all, such guarantees are not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some values of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author. As there, our algorithm succeeds even when feedback upon collision can be corrupted by an adaptive adversary, thanks to a strong no-collision property. Our lower bound is based on topological obstructions at multiple scales and is completely new.
    Adversarial Bandits Robust to $S$-Switch Regret. (arXiv:2205.14839v2 [cs.LG] UPDATED)
    We study the adversarial bandit problem under $S$ number of switching best arms for unknown $S$. For handling this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with basic OMD, achieving $\tilde{O}(S^{1/2}K^{1/3}T^{2/3})$. For improving the regret bound with respect to $T$, we propose to use adaptive learning rates for OMD to control variance of loss estimators, and achieve $\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_T(h^\dagger)}],S\sqrt{KT}\})$, where $\rho_T(h^\dagger)$ is a variance term for loss estimators.
    Machine learning fairness notions: Bridging the gap with real-world applications. (arXiv:2006.16745v5 [cs.LG] UPDATED)
    Fairness emerged as an important requirement to guarantee that Machine Learning (ML) predictive systems do not discriminate against specific individuals or entire sub-populations, in particular, minorities. Given the inherent subjectivity of viewing the concept of fairness, several notions of fairness have been introduced in the literature. This paper is a survey that illustrates the subtleties between fairness notions through a large number of examples and scenarios. In addition, unlike other surveys in the literature, it addresses the question of: which notion of fairness is most suited to a given real-world scenario and why? Our attempt to answer this question consists in (1) identifying the set of fairness-related characteristics of the real-world scenario at hand, (2) analyzing the behavior of each fairness notion, and then (3) fitting these two elements to recommend the most suitable fairness notion in every specific setup. The results are summarized in a decision diagram that can be used by practitioners and policymakers to navigate the relatively large catalog of ML.
    Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax Optimization. (arXiv:2206.02953v1 [math.OC])
    We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to sampling with replacement. For the smooth and strongly convex-strongly concave setting, we consider gradient descent ascent and the proximal point method, and present a unified analysis of two popular without-replacement sampling strategies, namely Random Reshuffling (RR), which shuffles the data every epoch, and Single Shuffling or Shuffle Once (SO), which shuffles only at the beginning. We obtain tight convergence rates for RR and SO and demonstrate that these strategies lead to faster convergence than uniform sampling. Moving beyond convexity, we obtain similar results for smooth nonconvex-nonconcave objectives satisfying a two-sided Polyak-{\L}ojasiewicz inequality. Finally, we demonstrate that our techniques are general enough to analyze the effect of data-ordering attacks, where an adversary manipulates the order in which data points are supplied to the optimizer. Our analysis also recovers tight rates for the incremental gradient method, where the data points are not shuffled at all.
    Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime. (arXiv:2206.02927v1 [stat.ML])
    We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v2 [cs.LG] UPDATED)
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
    Progressive Distillation for Fast Sampling of Diffusion Models. (arXiv:2202.00512v2 [cs.LG] UPDATED)
    Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.  ( 2 min )
    Impossibility of Collective Intelligence. (arXiv:2206.02786v1 [cs.LG])
    Democratization of AI involves training and deploying machine learning models across heterogeneous and potentially massive environments. Diversity of data opens up a number of possibilities to advance AI systems, but also introduces pressing concerns such as privacy, security, and equity that require special attention. This work shows that it is theoretically impossible to design a rational learning algorithm that has the ability to successfully learn across heterogeneous environments, which we decoratively call collective intelligence (CI). By representing learning algorithms as choice correspondences over a hypothesis space, we are able to axiomatize them with essential properties. Unfortunately, the only feasible algorithm compatible with all of the axioms is the standard empirical risk minimization (ERM) which learns arbitrarily from a single environment. Our impossibility result reveals informational incomparability between environments as one of the foremost obstacles for researchers who design novel algorithms that learn from multiple environments, which sheds light on prerequisites for success in critical areas of machine learning such as out-of-distribution generalization, federated learning, algorithmic fairness, and multi-modal learning.  ( 2 min )
    Building Robust Ensembles via Margin Boosting. (arXiv:2206.03362v1 [cs.LG])
    In the context of adversarial robustness, a single model does not usually have enough power to defend against all possible adversarial attacks, and as a result, has sub-optimal robustness. Consequently, an emerging line of work has focused on learning an ensemble of neural networks to defend against adversarial attacks. In this work, we take a principled approach towards building robust ensembles. We view this problem from the perspective of margin-boosting and develop an algorithm for learning an ensemble with maximum margin. Through extensive empirical evaluation on benchmark datasets, we show that our algorithm not only outperforms existing ensembling techniques, but also large models trained in an end-to-end fashion. An important byproduct of our work is a margin-maximizing cross-entropy (MCE) loss, which is a better alternative to the standard cross-entropy (CE) loss. Empirically, we show that replacing the CE loss in state-of-the-art adversarial training techniques with our MCE loss leads to significant performance improvement.  ( 2 min )
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v3 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.  ( 2 min )
    Computational Doob's $h$-transforms for Online Filtering of Discretely Observed Diffusions. (arXiv:2206.03369v1 [stat.ML])
    This paper is concerned with online filtering of discretely observed nonlinear diffusion processes. Our approach is based on the fully adapted auxiliary particle filter, which involves Doob's $h$-transforms that are typically intractable. We propose a computational framework to approximate these $h$-transforms by solving the underlying backward Kolmogorov equations using nonlinear Feynman-Kac formulas and neural networks. The methodology allows one to train a locally optimal particle filter prior to the data-assimilation procedure. Numerical experiments illustrate that the proposed approach can be orders of magnitude more efficient than the bootstrap particle filter in the regime of highly informative observations, when the observations are extreme under the model, and if the state dimension is large.  ( 2 min )
    Unsupervised tree boosting for learning probability distributions. (arXiv:2101.11083v5 [stat.ME] UPDATED)
    We propose an unsupervised tree boosting algorithm for inferring the underlying sampling distribution of an i.i.d. sample based on fitting additive tree ensembles in a fashion analogous to supervised tree boosting. Integral to the algorithm is a new notion of "addition" on probability distributions that leads to a coherent notion of "residualization", i.e., subtracting a probability distribution from an observation to remove the distributional structure from the sampling distribution of the latter. We show that these notions arise naturally for univariate distributions through cumulative distribution function (CDF) transforms and compositions due to several "group-like" properties of univariate CDFs. While the traditional multivariate CDF does not preserve these properties, a new definition of multivariate CDF can restore these properties, thereby allowing the notions of "addition" and "residualization" to be formulated for multivariate settings as well. This then gives rise to the unsupervised boosting algorithm based on forward-stagewise fitting of an additive tree ensemble, which sequentially reduces the Kullback-Leibler divergence from the truth. The algorithm allows analytic evaluation of the fitted density and outputs a generative model that can be readily sampled from. We enhance the algorithm with scale-dependent shrinkage and a two-stage strategy that separately fits the marginals and the copula. The algorithm then performs competitively to state-of-the-art deep-learning approaches in multivariate density estimation on multiple benchmark datasets.  ( 2 min )
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v3 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex), under an interpolation regime. At the heart of our analysis is a new generalization error bound for deterministic symmetric algorithms, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, Polyak-Lojasiewicz (PL), convex and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, under the proper choice of a decreasing step size. Further, if the loss is nonconvex but the objective is PL, we derive quadratically vanishing bounds on the generalization error and the corresponding excess risk, for a choice of a large constant step size. For (resp. strongly-) convex smooth losses, we prove that full-batch GD also generalizes for large constant step sizes, and achieves (resp. quadratically) small excess risk while training fast. In all cases, we close the generalization error gap, by showing matching generalization and optimization error rates. Our full-batch GD generalization error and excess risk bounds are strictly tighter than existing bounds for (stochastic) GD, when the loss is smooth (but possibly non-Lipschitz).  ( 2 min )
    A Robust Classification-autoencoder to Defend Outliers and Adversaries. (arXiv:2106.15927v2 [cs.LG] UPDATED)
    In this paper, a robust classification-autoencoder (CAE) is proposed, which has strong ability to recognize outliers and defend adversaries. The main idea is to change the autoencoder from an unsupervised learning model into a classifier, where the encoder is used to compress samples with different labels into disjoint compression spaces and the decoder is used to recover samples from their compression spaces. The encoder is used both as a compressed feature learner and as a classifier, and the decoder is used to decide whether the classification given by the encoder is correct by comparing the input sample with the output. Since adversary samples are seemingly inevitable for the current DNN framework, the list classifier to defend adversaries is introduced based on CAE, which outputs several labels and the corresponding samples recovered by the CAE. Extensive experimental results are used to show that the CAE achieves state of the art to recognize outliers by finding almost all outliers; the list classifier gives near lossless classification in the sense that the output list contains the correct label for almost all adversaries and the size of the output list is reasonably small.  ( 2 min )
    On Transportation of Mini-batches: A Hierarchical Approach. (arXiv:2102.05912v5 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with a very high number of supports. The m-OT solves several smaller optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, the m-OT does not consider the relationship between mini-batches which leads to undesirable estimation. Moreover, the m-OT does not approximate a proper metric between probability measures since the identity property is not satisfied. To address these problems, we propose a novel mini-batch scheme for optimal transport, named Batch of Mini-batches Optimal Transport (BoMb-OT), that finds the optimal coupling between mini-batches and it can be seen as an approximation to a well-defined distance on the space of probability measures. Furthermore, we show that the m-OT is a limit of the entropic regularized version of the BoMb-OT when the regularized parameter goes to infinity. Finally, we carry out experiments on various applications including deep generative models, deep domain adaptation, approximate Bayesian computation, color transfer, and gradient flow to show that the BoMb-OT can be widely applied and performs well in various applications.  ( 2 min )
    Concentration bounds for SSP Q-learning for average cost MDPs. (arXiv:2206.03328v1 [cs.LG])
    We derive a concentration bound for a Q-learning algorithm for average cost Markov decision processes based on an equivalent shortest path problem, and compare it numerically with the alternative scheme based on relative value iteration.  ( 2 min )
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v1 [stat.ML])
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimator from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that the greedy policy achieves, after $T$ rounds, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. We finally show with a proof of concept simulation that our policy achieves sub-linear fair pseudo-regret also in practice.  ( 2 min )
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v2 [stat.ML] UPDATED)
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.  ( 2 min )
    Relaxed Gaussian process interpolation: a goal-oriented approach to Bayesian optimization. (arXiv:2206.03034v1 [stat.CO])
    This work presents a new procedure for obtaining predictive distributions in the context of Gaussian process (GP) modeling, with a relaxation of the interpolation constraints outside some ranges of interest: the mean of the predictive distributions no longer necessarily interpolates the observed values when they are outside ranges of interest, but are simply constrained to remain outside. This method called relaxed Gaussian process (reGP) interpolation provides better predictive distributions in ranges of interest, especially in cases where a stationarity assumption for the GP model is not appropriate. It can be viewed as a goal-oriented method and becomes particularly interesting in Bayesian optimization, for example, for the minimization of an objective function, where good predictive distributions for low function values are important. When the expected improvement criterion and reGP are used for sequentially choosing evaluation points, the convergence of the resulting optimization algorithm is theoretically guaranteed (provided that the function to be optimized lies in the reproducing kernel Hilbert spaces attached to the known covariance of the underlying Gaussian process). Experiments indicate that using reGP instead of stationary GP models in Bayesian optimization is beneficial.  ( 2 min )
    Per-Instance Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v2 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose an efficient algorithm to compute per-instance privacy guarantees for individual examples when running DP-SGD. We use our algorithm to investigate per-instance privacy losses across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bounds. We further discover that the loss and the privacy loss on an example are well-correlated. This implies groups that are underserved in terms of model utility are simultaneously underserved in terms of privacy loss. For example, on CIFAR-10, the average $\epsilon$ of the class with the highest loss (Cat) is 32% higher than that of the class with the lowest loss (Ship). We also run membership inference attacks to show this reflects disparate empirical privacy risks.  ( 2 min )
    Generalization Error Bounds for Deep Neural Networks Trained by SGD. (arXiv:2206.03299v1 [cs.LG])
    Generalization error bounds for deep neural networks trained by stochastic gradient descent (SGD) are derived by combining a dynamical control of an appropriate parameter norm and the Rademacher complexity estimate based on parameter norms. The bounds explicitly depend on the loss along the training trajectory, and work for a wide range of network architectures including multilayer perceptron (MLP) and convolutional neural networks (CNN). Compared with other algorithm-depending generalization estimates such as uniform stability-based bounds, our bounds do not require $L$-smoothness of the nonconvex loss function, and apply directly to SGD instead of Stochastic Langevin gradient descent (SGLD). Numerical results show that our bounds are non-vacuous and robust with the change of optimizer and network hyperparameters.  ( 2 min )
    Learning in Observable POMDPs, without Computationally Intractable Oracles. (arXiv:2206.03446v1 [cs.LG])
    Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v1 [stat.ML])
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    Reweighting samples under covariate shift using a Wasserstein distance criterion. (arXiv:2010.09267v2 [math.ST] UPDATED)
    Considering two random variables with different laws to which we only have access through finite size iid samples, we address how to reweight the first sample so that its empirical distribution converges towards the true law of the second sample as the size of both samples goes to infinity. We study an optimal reweighting that minimizes the Wasserstein distance between the empirical measures of the two samples, and leads to an expression of the weights in terms of Nearest Neighbors. The consistency and some asymptotic convergence rates in terms of expected Wasserstein distance are derived, and do not need the assumption of absolute continuity of one random variable with respect to the other. These results have some application in Uncertainty Quantification for decoupled estimation and in the bound of the generalization error for the Nearest Neighbor Regression under covariate shift.  ( 2 min )
    Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds. (arXiv:2206.02834v1 [cs.LG])
    We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $\alpha$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(\alpha+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $\alpha$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.  ( 2 min )
    RORL: Robust Offline Reinforcement Learning via Conservative Smoothing. (arXiv:2206.02829v1 [cs.LG])
    Offline reinforcement learning (RL) provides a promising direction to exploit the massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative for value estimation and action selection. However, such conservatism impairs the robustness of learned policies, leading to a significant change even for a small perturbation on observations. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset and additional conservative value estimation on these OOD states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve the state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbation.  ( 2 min )

  • Open

    “Conscious AI”
    submitted by /u/DANGERD0OM [link] [comments]
    Quick question for all those who are trying to build stuff with AI/ML
    Quick question for all those who are trying to build stuff with AI/ML -Why do you care/not care about reproducible/usable code/models? i know it's a basic question but i'm trying to dive deeper and understand the underlying reasons about why it matters or doesn't matter to you. (5 whys analysis of this question basically) submitted by /u/MLtinkerer [link] [comments]  ( 1 min )
    Sparse Neural Networks Optimize Efficiency with Neuroscience
    submitted by /u/aidev2040 [link] [comments]
    MELODIES POSITIVE: An Artificial Waterfall.
    submitted by /u/cookingandcraft [link] [comments]
    DISCO DIFFUSION 3D AI ART ANIMATION | NIDAVELIR’S MAGNIFICENCE
    submitted by /u/Available_Tadpole829 [link] [comments]
    White Walkers - The Silent Death? - AI Art Experiment in 4K 60 FPS w/ GPT-3
    submitted by /u/MLInsights [link] [comments]
    r/StockNewsandEarnings - Join here for latest stock and crypto news, predictions and most anticipated company earnings report.
    https://www.reddit.com/r/StockNewsandEarnings/ submitted by /u/Brightnels [link] [comments]
    Sustainable AI hackathon
    Are you ready to put your coding skills to the test for the global cause? Swiss AI Association and Deep Learning Labs teamed up to make a Sustainable AI Hackathon! In the next few days, from 10th to 12th June, we welcome everyone interested in AI, regardless of their specialization or experience level. If you are looking for opportunities to learn from advanced AI experts - here it is! Pick one of three challenges and put your effort into making a difference and just have fun with AI: 👉 Ethical AI & Decision Making 👉 Sustainable Investing 👉 Cybersecurity in Fintech All participants in the Hackathon will get a cloud voucher from AWS worth 25 USD. And, of course, the best teams will get special rewards. Register here - https://lablab.ai/event/swissai I'll also be happy to answer any of your questions https://preview.redd.it/r9yjxb4b46491.png?width=1920&format=png&auto=webp&s=83351ac6d161581955eee534c7f5f1749866bb33 submitted by /u/zakrzzz [link] [comments]  ( 1 min )
    Top 10 AI Development and Implementation Challenges
    This article explains 10 challenges with AI development and implementation and ways to deal with them: https://www.toolbox.com/tech/artificial-intelligence/guest-article/top-10-ai-development-and-implementation-challenges/ submitted by /u/lklimusheuskaja [link] [comments]  ( 1 min )
    What do people think? This AI can tell this. Enter keywords you want to know!
    ​ https://reddit.com/link/v6qwbd/video/2e1u0wj326491/player submitted by /u/supercornson [link] [comments]  ( 1 min )
    When you have a bizzare idea for a crossover and access to an AI art generator:
    submitted by /u/summitofpizza [link] [comments]
    Inversing The Poles: Now What?? - AI Art Experiment in 4K w/ GPT-3
    submitted by /u/MLInsights [link] [comments]
    9 Best Artificial Intelligence books for beginners to expert to read in 2022
    submitted by /u/Lakshmireddys [link] [comments]  ( 1 min )
    Weekly China AI News: Chinese Top AI Institute Creates A Virtual Worm; Former JD.com AI Chief Joins Tsinghua University; Meet AI Video Generator CogVideo
    submitted by /u/trcytony [link] [comments]
    DISCO DIFFUSION 3D AI ART ANIMATION | DEATH IN HELHEIM (NIFLHEIM)
    submitted by /u/Available_Tadpole829 [link] [comments]
    Cyberpunk: The Fight Against The Oppressive Regime - AI Art Experiment
    submitted by /u/MLInsights [link] [comments]
  • Open

    [D] Quick question for all those who are trying to build stuff with AI/ML
    Quick question for all those who are trying to build stuff with AI/ML -Why do you care/not care about reproducible/usable code/models? i know it's a basic question but i'm trying to dive deeper and understand the underlying reasons about why it matters or doesn't matter to you. (5 whys analysis of this question basically) submitted by /u/MLtinkerer [link] [comments]  ( 1 min )
    [D] Neural Network Layers as Operations on Data Collections Types
    I had an observation recently that I wanted to share / get feedback on. Many of the a canonical deep learning layer types can be viewed as an operation on one of the basic data collection types used by Python (and other languages). Dense Layers -> Tuples Recurrent Layers -> Lists Attention Layers -> Sets Graph Neural Network Layers -> Dictionaries Am I missing any? submitted by /u/emuccino [link] [comments]  ( 1 min )
    [D] Masking out loss values
    Hey, I would like to start a discussion about following topic. I have a GAN with a Generator and a Discriminator. If I mask out some loss values randomly by lets say putting 10% of the Loss Values randomly to Zero. How does this affect the training? How does the optimizer handle such masking? Because such random masking of the losses creates Spikes in the loss surface or am I completely wrong? submitted by /u/SeucheAchat9115 [link] [comments]  ( 2 min )
    [D] Can we explain the deep prior regularisation by the differentiation step rather than architecture?
    As the post title says, is it possible to explain the ability of deep prior networks to perform tasks such as image inpainting to the implicit differentiation in the backpropogation rather than the architecture of the network. submitted by /u/vash_stampede08 [link] [comments]  ( 1 min )
    [D] How to balance production and research in a project, especially doing it alone.
    I need some advice as I want to deliver better results. I'm doing a project with provision from a professor but most of the time I do it alone as he does not have much spare time. He want me to produce some results in image processing like object detection and publish some research to conferences, particularly in ML and CV. But after working for a while, I could not produce any meaningful results and haven't published any paper. Basically I'm struggling in both objectives so I hope I can get some advice here. Should I lean more to production or research? Or should I quit after all? submitted by /u/IcySnowy [link] [comments]  ( 1 min )
    [R] What is the best summary of neural tangent kernel research thus far?
    Do folks here have good references for a summary in what progress has been made in neural tangent kernel (NTK) research? There's an excellent and approachable blog post about the state of the field in 2018-2019 (https://rajatvd.github.io/NTK/), but I assume that there's been a lot of follow-on work since. Thanks! submitted by /u/Yukiomo [link] [comments]  ( 1 min )
    [Discussion] Tracking, running and managing experiments in sandbox environment
    Hi everyone, I'm looking for a system for collecting and sharing KPIs, managing and running experiments on local and remote nodes. My requirements are: Sandbox environment: server and nodes are running in a private network with no internet access Nice graphs and easy comparisons Easy way to share datasets Supports local and remote nodes ​ Preferred, but not mandatory: Open source Supports distributed training ​ I've heard a lot of good recommendations on ClearML and Weights & Biases. I tested out ClearML and Weights & Biases to see if they work in a sandbox env, but when the servers tried to validate the free trial license, it failed on connectivity issues. ​ Does anyone knows of other Experiment Management system that can work without internet access, and have similar capabilities to ClearML and Weights & Biases? submitted by /u/Intelligent_Gene_283 [link] [comments]  ( 1 min )
    [P] A shared arxiv-PDF-viewer
    What if you could read a paper and at the same time have a scientific discussion about certain paragraphs or figures. Just mark the sentence or picture and create a new thread about it, ask a question, explain something in greater detail or link to a blog post that explains a concept better. I think that would be awesome and a win-win for the authors and the readers. I am kind of a scientist myself and I would love to see something like it. Does something like that exist? If not, I would like to make that a (shared) project. Looking forward to your suggestions! submitted by /u/mingaflo [link] [comments]  ( 2 min )
    [P] Several of my past and current projects / “Amateur” programmer fought cancer with 50 Nvidia Geforce 1080Ti
    ​ https://preview.redd.it/fzh4ghfmn7491.jpg?width=600&format=pjpg&auto=webp&s=ac0985c3373a27a59c5a819ecc5b833b334651a4 Since this whole series of projects and reports kind of started from this sub Reddit, I feel like it's appropriate to have a thread here to organize them. First, the English translation of the news article: https://howardchen.substack.com/p/this-amateur-programmer-fought-cancer?s=w The original Chinese version: https://www.toutiao.com/article/7094940100450107935/ The video: https://www.bilibili.com/video/BV1x3411V7tL?spm_id_from=333.337.search-card.all.click (In Chinese, more than 2M views at the moment) Youtube version: https://www.youtube.com/watch?v=-t-a6l8a2N0&t=3s The Hacker News discussion a couple weeks ago: https://news.ycombinator.com/item?id=31449147…  ( 1 min )
    [R] On the Advance of Making Language Models Better Reasoners - 2022 Microsoft
    Paper: https://arxiv.org/abs/2206.02336 Abstract: Large language models such as GPT-3 and PaLM have shown remarkable performance in few-shot learning. However, they still struggle with reasoning tasks such as the arithmetic benchmark GSM8K. Recent advances deliberately guide the language model to generate a chain of reasoning steps before producing the final answer, successfully boosting the GSM8K benchmark from 17.9% to 58.1% in terms of problem solving rate. In this paper, we propose a new approach, DiVeRSe (Diverse Verifier on Reasoning Step), to further advance their reasoning capability. DiVeRSe first explores different prompts to enhance the diversity in reasoning paths. Second, DiVeRSe introduces a verifier to distinguish good answers from bad answers for a better weighted voting. Finally, DiVeRSe verifies the correctness of each single step rather than all the steps in a whole. We conduct extensive experiments using the latest language model code-davinci-002 and demonstrate that DiVeRSe can achieve new state-of-the-art performance on six out of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%), outperforming the PaLM model with 540B parameters. https://preview.redd.it/905d5ndrf6491.jpg?width=722&format=pjpg&auto=webp&s=069c7fd4c8039e7d3b542656a295221114552a4e https://preview.redd.it/7toqjvnrf6491.jpg?width=1136&format=pjpg&auto=webp&s=b0e9c78c9fee38c0828b2c5466679ad14ce2a631 https://preview.redd.it/kcxa0izrf6491.jpg?width=561&format=pjpg&auto=webp&s=aa377cb2a6168bd4ebb213c465dff0b3145397d5 submitted by /u/Singularian2501 [link] [comments]  ( 2 min )
    [N] Sustainable AI Hackathon
    Are you ready to put your coding skills to the test for the global cause? Swiss AI Association and Deep Learning Labs teamed up to make a Sustainable AI Hackathon! In the next few days, from 10th to 12th June, we welcome everyone interested in AI, regardless of their specialization or experience level. If you are looking for opportunities to learn from advanced AI experts - here it is! Pick one of three challenges and put your effort into making a difference and just have fun with AI: 👉 Ethical AI & Decision Making 👉 Sustainable Investing 👉 Cybersecurity in Fintech All participants in the Hackathon will get a cloud voucher from AWS worth 25 USD. And, of course, the best teams will get special rewards. Register here - https://lablab.ai/event/swissai I'll also be happy to answer any of your questions https://preview.redd.it/2axn6q3756491.png?width=1920&format=png&auto=webp&s=1928df41c84e4550f72215c997dba5faaac3ae09 submitted by /u/zakrzzz [link] [comments]  ( 1 min )
    [D] [R] Dialogue generation with contrastive objectives
    There are recent studies which demonstrate that fine-tuning a language model with contrastive loss on the token level (maximising similarity between the current token representation and the most probable next tokens in a sequence) can lead to more coherent responses. As an example we can take the paper "A Contrastive Framework for Neural Text Generation" - https://arxiv.org/abs/2202.06417. I am wondering is it a fruitful direction for research and development to try to implement similar contrastive objectives but on the sentence level for tasks such as open-domain dialogue generation... For example we can have a language model try to complete a conversation and in addition to the language modelling loss, there might be another loss maximising contrast between the expected sequence of gold tokens and some distractors. One of the most cited papers on the topic of dialogue generation - the TransferTransfo does something similar in principle by incorporating a classifier optimised to find the original completion in a pool of 20 randomly sampled distractor sentences (https://arxiv.org/abs/1901.08149). So do you think recreating similar architecture which however incorporates contrastive learning can lead to performance improvements? If not, what are your concerns? What are some other suggestions on what can be done as a further research in the field? submitted by /u/radi-cho [link] [comments]  ( 1 min )
    [D] What are the websites to collect text/image data for a new dataset on a particular topic, such as Reddit, Quora etc.?
    I am working on a research project, and I need text/image data on a particular topic. I need to do text analysis and maybe build a text model on top of that. Where can I get data on specific topics? Some of the websites I can recall are Reddit, quora. Does anyone know other sources? Please share. submitted by /u/SnooSketches2908 [link] [comments]  ( 1 min )
    [D] How is the job market for machine learning and AI in Australia? Is it comparable to those in EU major economies and Canada?
    Hello. I'm curious about what the job market for ML or AI is like in Australia. Are there a lot of opportunities there, in general? I've looked at some ML jobs in UK, Canada, and Germany, and while the market is obviously not as good as that in the US (that's to be expected tbh), it seems pretty good. Is Australia's ML job market comparable to those countries? Or is it much worse? I've heard conflicting experiences on the state of ML jobs in Australia so I was curious. Thanks! submitted by /u/masters-in-phd [link] [comments]  ( 1 min )
  • Open

    Help Support Women in AI
    Editor’s Note: I do not normally treat social media posts as articles, but this one is a bit special. Author Andrew Jones contacted me about helping to promote a new scholarship fund of $50,000 to help promote Women in Data Science, in conjunction with Women in AI. It is a worthwhile endeavor, and I hope… Read More »Help Support Women in AI The post Help Support Women in AI appeared first on Data Science Central.  ( 2 min )
    Can Starting with Waterfall Lead to Better Agile? Part II
    In my previous article, I discussed Waterfall, WaterScrumFall, big-A Agile, and business agility.  Dissonance abounds among organizations struggling to transition their approaches to building solutions and maintaining their existing legacy infrastructures while remaking how they evolve themselves.  In attempting to navigate this they often adopt approaches that are destined not to get them where they… Read More »Can Starting with Waterfall Lead to Better Agile? Part II The post Can Starting with Waterfall Lead to Better Agile? Part II appeared first on Data Science Central.  ( 7 min )
    Can Starting with Waterfall Lead to Better Agile?  Part I
    Waterfall, WaterScrumFall, Agile and Agility We should all be aware that business agility is the primary enabler for companies seeking sustainability.  In the past, companies would evolve in chunks, a project at a time.  The costs and risks of implementing change, whether system-related or otherwise, were large, so designing and planning upfront (the Waterfall approach)… Read More »Can Starting with Waterfall Lead to Better Agile?  Part I The post Can Starting with Waterfall Lead to Better Agile?  Part I appeared first on Data Science Central.  ( 5 min )
    Functional Testing in Agile Environment: All You Need to Know
    With today’s customers becoming more tech-savvy and sophisticated, the software businesses are becoming extremely competitive where quality is a critical factor for software projects/products and customers. Today, projects run behind schedule because of multiple factors that include requirements changing so rapidly. The quality of the project/product is determined by other factors like the skill set… Read More »Functional Testing in Agile Environment: All You Need to Know The post Functional Testing in Agile Environment: All You Need to Know appeared first on Data Science Central.  ( 3 min )
    Load Testing: Top Tools
    A subset of performance testing, load testing is just the concept of testing a given software’s ability to withstand the load, i.e., concurrent users. It refers to a kind of performance testing that determines the performance of the systems under real-life load conditions. And, this testing helps determine how the application behaves when accessed by… Read More »Load Testing: Top Tools The post Load Testing: Top Tools appeared first on Data Science Central.  ( 3 min )
    Taking Advantage of Good Press
    Getting advertised for your work or brand is an excellent way to gain the public’s attention for substantial success. It helps to score a good mark significantly in the public’s minds and gain popularity. Nowadays, all the things are based around bringing social media into play for your business.  If your account is trending on… Read More »Taking Advantage of Good Press The post Taking Advantage of Good Press appeared first on Data Science Central.  ( 3 min )
    Data protection problems, principles and identity solutions
    CEO Dave McComb, President of Semantic Arts, noted during a talk in 2021 that one of his banking clients had customer US Social Security Numbers (unique government IDs banks typically use to authenticate the customer’s identity and control customer access to the system) stored in over 8,000 different places.  It was not unusual for this… Read More »Data protection problems, principles and identity solutions The post Data protection problems, principles and identity solutions appeared first on Data Science Central.  ( 4 min )
    What’s the Value of an AI Engineering Certificate?
    The answer is a resounding YES! Artificial Intelligence is a stream of work that requires high-level expertise in popular AI skills. To leverage maximum benefits from the AI industry, it becomes imperative to add that metal to your educational qualifications with the world’s best AI engineer certification. The decades have gone by validating the rising demand for skilled… Read More »What’s the Value of an AI Engineering Certificate? The post What’s the Value of an AI Engineering Certificate? appeared first on Data Science Central.  ( 3 min )
    Five Technologies that Power the Metaverse
    The Metaverse is a platform that sounds like a sci-fi concept, but shockingly, it’s as real as the internet. However, this technology is new. Before it becomes the new normal, there are plenty of improvements, trends, and modifications it will go through. With time as everything is changing, the way we entertain ourselves, shop, watch… Read More »Five Technologies that Power the Metaverse The post Five Technologies that Power the Metaverse appeared first on Data Science Central.  ( 4 min )
    Omicron, COVID Variants, and Data Chaos
    Clinical research and data analytics teams need tools that will not just fix data chaos but properly store and analyze data in the first place.  Nothing proves this so well as the cascade of COVID-19 variants.   First, COVID-19 shut the world down for months. Then the Delta variant heralded a new wave of fear and… Read More »Omicron, COVID Variants, and Data Chaos  The post Omicron, COVID Variants, and Data Chaos  appeared first on Data Science Central.  ( 3 min )
    Reasons for the Cybersecurity Talent Gap
    According to a National Institute of Standards and Technology (NIST) project called Cyber Seek, there are around a million people employed in cybersecurity roles in the US. Cyber Seek estimates there are close to 600,000 US vacancies in the field. Moreover, vacancies will grow sharply through 2025.   Globally, (ISC)², a nonprofit dedicated to training and… Read More »Reasons for the Cybersecurity Talent Gap The post Reasons for the Cybersecurity Talent Gap appeared first on Data Science Central.  ( 4 min )
    A Brief Guide to Writing a Good Dissertation
    A dissertation is a piece of academic writing that aims to demonstrate the validity of a hypothesis. This phase focuses on collecting and analyzing data to prove that hypothesis. It is the most time-consuming part of the dissertation writing process. Evidence may be collected from a variety of sources. However, it must be relevant and… Read More »A Brief Guide to Writing a Good Dissertation The post A Brief Guide to Writing a Good Dissertation appeared first on Data Science Central.  ( 6 min )
    Understanding Causal AI Applications
    Most ML developers today are not familiar with causal models. Current ML models are based on co-relation. In contrast, causal models deal with cause and effect. Furthermore, correlation-based models have limited explainability, do not handle novel situations well, and need a lot more data. Causal models overcome many of these limitations. Causal models can answer… Read More »Understanding Causal AI Applications The post Understanding Causal AI Applications appeared first on Data Science Central.  ( 2 min )
    The Four Golden Signals of Kubernetes
    As a globally-used system with millions of developers employing it as their primary building environment, Kubernetes is one of the most well-known tools for container management in the world. The post The Four Golden Signals of Kubernetes appeared first on Data Science Central.  ( 4 min )
    Data Trends in IoT Packaging
    For many years brands have failed to deliver valuable packaging experiences due to a lack of technology implementation. But now due to the introduction of advanced technologies such as IoT in packaging, we are sighted the future of this industry is intelligent, media-enhanced packaging. In recent years, the packaging industry has witnessed a tremendous increase… Read More »Data Trends in IoT Packaging The post Data Trends in IoT Packaging appeared first on Data Science Central.  ( 3 min )
  • Open

    Collin Stultz named co-director and MIT lead of the Harvard-MIT Program in Health Sciences and Technology
    MIT professor will leverage his research into machine learning and computer science, as well as his role as a practicing cardiologist, toward educating clinician-scientists and engineers.  ( 6 min )
  • Open

    I trained a NN to play a match 3 type of game
    submitted by /u/blazarious [link] [comments]  ( 1 min )
    Any experts with A2C graphs?
    Trying to improve my model, but needed to understand these graphs. Does anyone understand what they mean please? I know what the rewards graphs indicate, but Im confused with the rest. Any help would be appreciated ​ The Rewards Graph Other training graphs I dont understand submitted by /u/pssword123 [link] [comments]  ( 1 min )
    Procedure cloning
    submitted by /u/dwightschrute1905 [link] [comments]
    Pokemon Showdown AI - Policy Iteration Approach
    Hi Everyone, I have mocked together a self-play pokemon showdown ai that utilises many of the techniques employed in Alphastar. These include: Transformer for Team/Moveset Embeddings Encoding Field / Terrain / Weather Layer Norm LSTM Action Type (Move or Switch) as well as Move and Switch Heads VTrace, UPGO (unique to AlphaStar, cannot find much on it) for Policy loss and TD Lambda for Value loss However, I am confused about how to design a reward function for a pokemon battle. The simple answer is to reward -1 for losing and +1 for winning, but this is too sparse and does not converge fast. I have a reward for fainting and hp % as well as whether a pokemon uses a move that other is immune to / fails. What other rewards could/should I consider? In the Alphastar pseudocode, they calculate the loss on the policy and value networks separately for each reward signal. Is this also the right approach here? How should I weigh these rewards such that the agent does not simply favor fainting and lose sight of winning the game? In Alphastar, they use a discount factor of 1. My understanding is that the longer the episode, the closer the discount factor should be to 1. This makes sense for a game like StarCraft, though what should it be for pokemon (20-40 steps per battle)? My current parameters are very similar to AlphaStar but adjusted to run on my personal computer 12 actors, CPU for rollout, GPU for Learning trajectory length = 32 batch size = 128 learning rate = 3e-5 discount factor = 0.9 entropy discount = 1e-2 ​ Any advice/literature on the issues above would be greatly appreciated. submitted by /u/atomicburn125 [link] [comments]  ( 2 min )
    Exploding Losses
    What does it mean when the actor/critic losses explode? ​ Rewards at the end of each epoch look like: Epoch 1: +10 Epoch 2: +12 Epoch 3: +3 Epoch 4: -12 ... Epoch 56: - 3212 submitted by /u/XecutionStyle [link] [comments]  ( 1 min )
  • Open

    Sparse Neural Networks Optimize Efficiency with Neuroscience
    submitted by /u/aidev2040 [link] [comments]
  • Open

    End-to-end Generative Pre-training for Multimodal Video Captioning
    Posted by Paul Hongsuck Seo and Arsha Nagrani, Research Scientists, Google Research, Perception Team Multimodal video captioning systems utilize both the video frames and speech to generate natural language descriptions (captions) of videos. Such systems are stepping stones towards the longstanding goal of building multimodal conversational systems that effortlessly communicate with users while perceiving environments through multimodal input streams. Unlike video understanding tasks (e.g., video classification and retrieval) where the key challenge lies in processing and understanding multimodal input videos, the task of multimodal video captioning includes the additional challenge of generating grounded captions. The most widely adopted approach for this task is to train an encoder-…  ( 7 min )
  • Open

    Create train, test, and validation splits on your data for machine learning with Amazon SageMaker Data Wrangler
    In this post, we talk about how to split a machine learning (ML) dataset into train, test, and validation datasets with Amazon SageMaker Data Wrangler so you can easily split your datasets with minimal to no code. Data used for ML is typically split into the following datasets: Training – Used to train an algorithm […]  ( 7 min )
    How InfoJobs (Adevinta) improves NLP model prediction performance with AWS Inferentia and Amazon SageMaker
    This is a guest post co-written by Juan Francisco Fernandez, ML Engineer in Adevinta Spain, and AWS AI/ML Specialist Solutions Architects Antonio Rodriguez and João Moura. InfoJobs, a subsidiary company of the Adevinta group, provides the perfect match between candidates looking for their next job position and employers looking for the best hire for the […]  ( 8 min )
  • Open

    Festo Develops With Isaac Sim to Drive Its Industrial Automation
    Dionysios Satikidis was playing FIFA 19 when he realized the simulated soccer game’s realism offered a glimpse into the future for training robots. An expert in AI and autonomous systems at Festo, a German industrial control and automation company, he believed the worlds of gaming and robotics would intersect. “I’ve always been passionate about technology Read article > The post Festo Develops With Isaac Sim to Drive Its Industrial Automation appeared first on NVIDIA Blog.  ( 3 min )
    What Is Zero Trust?
    For all its sophistication, the Internet age has brought on a digital plague of security breaches. The steady drumbeat of data and identity thefts spawned a new movement and a modern mantra that’s even been the subject of a U.S. presidential mandate — zero trust. So, What Is Zero Trust? Zero trust is a cybersecurity Read article > The post What Is Zero Trust? appeared first on NVIDIA Blog.  ( 6 min )
    Feel the Need … for Speed as ‘Top Goose’ Debuts In the NVIDIA Studio
    This week In the NVIDIA Studio takes off with the debut of Top Goose, a short animation created with Omniverse Machinima and inspired by one of the greatest fictional pilots to ever grace the big screen. The project was powered by PCs using the same breed of GPU that has produced every Best Visual Effects nominee at the Academy Awards for 14 years: multiple systems with NVIDIA RTX A6000 GPUs and an NVIDIA Studio laptop — the Razer Blade 15 with a GeForce RTX 3070 Laptop GPU. The post Feel the Need … for Speed as ‘Top Goose’ Debuts In the NVIDIA Studio appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Will Artificial Intelligence take over humanity?
    An AI wrote this article. The last words are very frightening!  ( 3 min )
    Mitigating AI Bias, with …Bias
    This article is part of my Data Trust series of talks. The purpose of these articles are to break down complex but important… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )

  • Open

    DISCO DIFFUSION 3D AI ART ANIMATION | ALFHEIM’S BRILLIANT LIGHT MAGIC
    submitted by /u/Available_Tadpole829 [link] [comments]
    AI Dream 53 - EPIC Cosmic Neural Exploration by AI
    submitted by /u/LordPewPew777 [link] [comments]
    Some images I generated with DALL-E Mini
    submitted by /u/Gengar218 [link] [comments]
    What would Mona Lisa look like with a body? DALL-E 2 has an answer
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    A framework for enterprise AI adoption
    submitted by /u/bendee983 [link] [comments]
    CMU Researchers Propose Deep Attentive VAE: The First Attention-Driven Framework For Variational Inference In Deep Probabilistic Models
    The expressivity of current deep probabilistic models can be improved by selectively prioritizing statistical dependencies between latent variables that are potentially distant from each other. Attention mechanisms can be leveraged to build more expressive variational distributions in deep probabilistic models by explicitly modeling both nearby and distant interactions in the latent space. Attentive inference reduces computational footprint by alleviating the need for deep hierarchies. 👉 It achieves state-of-the-art log-likelihoods while using fewer latent layers and requiring less training time than existing models. The proposed holistic inference reduces computational footprint by alleviating the need for deep hierarchies. Continue reading | Check out the paper, github and blog post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Understanding Sampling With and Without Replacement (Python)
    submitted by /u/mgalarny [link] [comments]
    Golden Ratio: The Divine Proportion - [4K] Neural-Art Visualization Experiment w/ [VQGAN+CLIP]
    submitted by /u/MLInsights [link] [comments]
    How to read more papers? Here's how to make the process more friendly, efficient, and healthy
    submitted by /u/OnlyProggingForFun [link] [comments]
    Kaleidoscope: What Do You See? - [4K60 Visualization] Neural-Art Experiment
    submitted by /u/MLInsights [link] [comments]
    The story behind OpenAI API actually resembles many of the stories of startups built on top of it. 🚀 I had the privilege to interview Peter Welinder, VP of Product and Partnerships at OpenAI, together with Shubham Saboo for our O'Reilly Media book.
    submitted by /u/techn0_cratic [link] [comments]  ( 1 min )
    Researchers From Columbia University Propose ‘Neural Voice Camouflage’: An Adversarial Attack-Based Approach That Disrupts Automatic Speech Recognition Systems In Real-Time
    Have you ever had the uneasy sense that someone is listening in on your every word? This is because it may be true. Companies have been employing “bossware” to listen to their employees while they are near their computers since the dawn of time. Several “spyware” apps are available that can record phone calls. Automatic Speech Recognition models like Amazon’s Echo and Apple’s Siri may record your daily conversation based on the voice commands. To address this critical problem, a group of researchers from Columbia University has devised a new method called Neural Voice Camouflage. The crux behind the technology is that it creates bespoke audio noise in the background as a person speaks, which confuses the artificial intelligence model that transcribes the recorded sounds. The new system utilizes an “adversarial attack” method, in which machine learning is used to change sounds in such a manner that other AI models misinterpret them as something else. In some ways, it uses a machine learning model to deceive another. This procedure, however, is not as simple as it may appear because the model must first process the entire sound clip before knowing how to change it, rendering it non-functional in real-time. Several research groups have attempted to construct robust models that can break neural networks by operating in real-time throughout the previous decade. However, they have failed to achieve both prerequisites. 👉 Under real-time constraints, the proposed method jams the established speech recognition system DeepSpeech 3.9x more than baselines as measured through word error rate, and 6.6x more as measured through character error rate. Continue reading | Paper submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Singapore-Based Researchers Launch ‘AI Verify’: The First AI Governance Testing Framework (MVP)
    Infocomm Media Development Authority of Singapore (IMDA) and the Personal Data Protection Commission (PDPC) launch the first AI Governance Testing Framework and Toolkit called AI Verify for organizations looking to demonstrate responsible AI measurably. As an early-stage product, AI Verify attempts to increase trust between businesses and their stakeholders by performing technological testing and process audits in conjunction with each other. There is a constant need for the public to be assured that AI systems are fair, explainable, safe, and accountable; as more products and services use AI to personalize or make autonomous predictions. The objective is to increase public confidence in AI while encouraging its more comprehensive application. Voluntary AI governance frameworks and guidelines have been published to help system owners and developers implement trustworthy AI products and services. Developers and owners of AI systems who want to be more transparent about their systems’ performance through technical tests and process checks can get this as a Minimum Viable Product (MVP). Understanding how AI models make judgments and if the AI predictions models make have any unintentional bias is a vital part of transparency. AI systems should be held accountable and subject to scrutiny. To test the MVP, companies are asked to join in the trial. Continue reading | 'AI Verify' Paper submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    An idea to test AI safety is "a virtual world honey pot" (not fleshed out)
    I doubt this idea is unique. And I'm unsure it is workable but hear me out. When an AI becomes smarter than humans I (and others) see there would be a potential safety problem. I propose we sandbox the AI in a small virtual world and somehow give the AI the impression that it is reality. I also think the ai should become smarter than the virtual humans (another ai probably) over time and given more and more decision-making power like the scenario that could happen in the real world as AI produces more results/profit and is more trusted. This virtual world is a honeypot of sorts. You could potentially test different safety measures such as shut-off functionality, optimization functions, and adherence to ethical systems. You could put the AI you want to use in there without problems, but I see the fake humans and how realistic the small world is being a problem with the realisticity of the results. Does this concept have a name? Any ideas to make it more workable? Feel free to point out why it won't work. submitted by /u/TransracialAsian [link] [comments]  ( 1 min )
    Magic Spells - Neural-Art Experiment [4K60]
    submitted by /u/MLInsights [link] [comments]
  • Open

    [N]Ensemble Method Reasoning
    I recently conducted experiments on Ensemble methods for DL models. Weighted ensemble gets the lowest error compared to Stacking, Bagging. However I'm unable to reason this out since thee is no research paper on this or no thread on this topic. Can someone help me out. Maybe if they have any research paper on this?. submitted by /u/AppyFizz93 [link] [comments]
    6D pose estimation of a known 3D CAD object 2022 [D]
    A follow up of this question from 2020, I'm looking for a codebase for 6DOF pose estimation of a known 3D CAD object with RGB or RGBD. It must be: - Usable commercially (licensed under BSD, MIT, BOOST, etc.), not GPL. -Easy to setup and use (having a running colab example would be great) -The training time required for a new CAD object should be on the order of hours, not days. -State of the art of near state of the art results. (See https://bop.felk.cvut.cz/home/ for benchmarks) What would you suggest? submitted by /u/lorepieri [link] [comments]  ( 1 min )
    [P] Releasing 🤗 Evaluate - an evaluation library for ML
    Evaluation is one of the most important aspects of ML but today’s evaluation landscape is scattered and undocumented which makes evaluation unnecessarily hard. For that reason we are excited to release 🤗 Evaluate! https://github.com/huggingface/evaluate TL;DR 🤗 Evaluate is a Python library that let's you evaluate models and datasets with a wide range of tools! The core principles are: reproducibility: run evaluation and save results in a reproducible manner documentation: each evaluation module comes with a metric cards documenting it's use and limitations ease-of-use: same interface for a wide range of evaluation tools coverage: in addition to model metrics we also include comparisons and measurements (more about this below) multimodal: the library covers a wide range of eval…  ( 4 min )
    [N] Graphsignal profiler now supports distributed training, automatic tracing and more frameworks.
    Hi everyone! We've recently introduced several new features to Graphsignal Profiler that I'm very excited to share here. In addition to TensorFlow, PyTorch and Keras, the profiler now natively supports Hugging Face and PyTorch Lightning. A built in support for distributed training has been added. More info here. Trace information (using Chrome trace format) is now automatically available in the profiles. To try it out, simply follow the Quick Start guide. Any feedback is welcome! submitted by /u/l0g1cs [link] [comments]  ( 1 min )
    [D] Saving/managing the models developed by multiple people in a research group
    Hi, all. As a research group, we’re looking for a kind of code management method such that we can use to save our models/notebooks for long term and reuse when we need it. Our projects are mainly developed by deep learning methods. What is the best way to save and store the codes that are developed in a research group? -I’m wondering if there is a better solution than github or similar platform. submitted by /u/osedao [link] [comments]  ( 1 min )
    [D] Making sense of questionable transposed convolution decoder.
    The decoder in question comes from this article and is responsible for converting a set of features obtained from a sequence of frames into an audio sample of 40ms, or 640 samples total. https://preview.redd.it/6a702r0y5z391.png?width=901&format=png&auto=webp&s=67020e8e41fadaf1e22824430ab78d93556cdcc9 The problem arises when you begin doing calculations with the parameters presented by the authors. Even assuming that only stride would correspond to an increase in the length of output sequence, the total length would come out to [batch, 1, 1280] - twice the size presented in the image. Going further and accounting for kernel sizes nets a vector of size [batch, 1, 9380] which is several times larger. Looking into it further, I found the authors mention taking inspiration from WaveGAN, where according to the structure presented, the increases in length seems to be in line with the stride used. They also seem to have an official implementation using Tensorflow v1. However, looking at the documentation for TF's implementation, they still count the kernel size towards the resulting output, so the math doesn't seem to add up. https://preview.redd.it/5xf4gc49az391.png?width=908&format=png&auto=webp&s=afad6527849b5e980cd7c7065ad35fa1489f4849 So my question is: how does one go about getting the correct output size with in this situation? submitted by /u/ShujiMikami [link] [comments]  ( 1 min )
    [D] My impression is that PhDs in this field do not necessarily confer financial benefits, and there appears to be fewer industry and academic R&D positions relative to the number of PhD graduates. If this is true, then why are PhD admissions so competitive in this field?
    As an outsider of this field who is considering learning more about the field and potentially working in R&D in the future, it appears to me like pursuing a PhD in the field right now is a relatively risky prospect. I read a lot of discussion posts on this and not only is there the fact that one has to accept the opportunity cost of the PhD, but a guarantee or a strongly reasonable expectation to land a tenure-track professor position or a well-compensated and engaging R&D position does not exist either. It seems that chances are, when all is said and done, one is likely to end up in a position that may ultimately not require the PhD credentials. If this is truly the case, then why are PhD admissions so competitive? It seems like I am missing an important piece of information that could explain the situation to me. submitted by /u/UniverseModulator [link] [comments]  ( 6 min )
    [D] Machine Learning Models for Longitudinal Data
    Recently, I had the following question about supervised classification models (e.g. random forest) for longitudinal data. Suppose I have the following data about students passing a fitness test - the students (each student has an "id") who enroll in a school take a fitness test each year and record their height and weight (at the start of each school year, before the fitness test). They can either pass (1) or fail (0) the fitness test each year. The school is interested in knowing which students are likely to fail the fitness test, so they can focus more attention on these students. Naturally, some students might have taken the fitness test more times than other students. I simulated some data (using the R programming language) to show how the historical data might look like: ​ score <…  ( 3 min )
  • Open

    Amazon SageMaker Studio and SageMaker Notebook Instance now come with JupyterLab 3 notebooks to boost developer productivity
    Amazon SageMaker comes with two options to spin up fully managed notebooks for exploring data and building machine learning (ML) models. The first option is fast start, collaborative notebooks accessible within Amazon SageMaker Studio – a fully integrated development environment (IDE) for machine learning. You can quickly launch notebooks in Studio, easily dial up or […]  ( 6 min )
    Reinventing retail with no-code machine learning: Sales forecasting using Amazon SageMaker Canvas
    Retail businesses are data-driven—they analyze data to get insights about consumer behavior, understand shopping trends, make product recommendations, optimize websites, plan for inventory, and forecast sales. A common approach for sales forecasting is to use historical sales data to predict future demand. Forecasting future demand is critical for planning and impacts inventory, logistics, and even […]  ( 9 min )
  • Open

    Building Value-driven Data Strategy and Economies of Learning – Part 1
    A big problem with the “Data Strategy” conversation is that many organizations think of a “Data Strategy” as a deliverable, not a journey. A Data Strategy, like a Business Strategy, should ebb and flow depending upon what is “valuable” to the organization given the current business environment. And the current business environment is constantly changing. … Read More »Building Value-driven Data Strategy and Economies of Learning – Part 1 The post Building Value-driven Data Strategy and Economies of Learning – Part 1 appeared first on Data Science Central.  ( 5 min )
    The Data Product ABCs – A Framework for Bringing Product Thinking to Data
    Let’s be honest: The way we’ve been managing data for the past 30 years hasn’t fundamentally changed. Yes, the shift to the cloud and the Modern Data Stack is making the life of data engineers easier because you don’t have to worry as much about infrastructure. Want data in a warehouse? Click, click, click… You… Read More »The Data Product ABCs – A Framework for Bringing Product Thinking to Data The post The Data Product ABCs – A Framework for Bringing Product Thinking to Data appeared first on Data Science Central.  ( 7 min )
  • Open

    Hallucinating to better text translation
    A machine-learning method imagines what a sentence visually looks like, to situate and ground its semantics in the real world, improving translation, like humans can.  ( 7 min )
  • Open

    Dealing with Delayed Actions/Rewards
    I am currently dealing with an RL issue and could use some advice. I have an environment where actions selected by the agent take time before they produce a reward. Additionally, I know the exact reward when the action is taken. 1) Is there anything in the literature that handles situations like this? 2) Is it best to accurately model my environment and obey the delayed rewards or is "cheating" acceptable if it makes training easier? Thanks for your time! submitted by /u/knightmare9114 [link] [comments]  ( 1 min )
    problem of precision in continuous actions of an environment (different magnitude and precision depending on the states)
    I'm interested to know how you can solve some problems where the magnitude and precision of the actions are different depending on the state the agent is in and that are key to solving the environment successfully. For example, let's suppose that my actions are limited to (-1,+1). I have a robotic environment where at the start of the episode the "correct" actions are near the limit, but later on must be near 0 (-0.05, 0.05). The problem in my case is that -0.05 is very different to 0.05 in those states (moves to very different state space), so I ideally should need more precision for those range of actions (and states). I could define a magnitude conversion of the actions depending on the state the agent is in, but are there any other cleaner ways to tackle this problem? submitted by /u/NavirAur [link] [comments]  ( 1 min )
    Easiest way to get an image of my environment
    Hi, I am working with simple spread (multi agent particle environments). I would like to modify the observation so that it sends an image to the agent. I thought about using the render function to generate an image, but the problem is that I am running the code from a remote server and I get a super long error. The fix seems to be involved (https://stackoverflow.com/questions/40195740/how-to-run-openai-gym-render-over-a-server), so I was wondering if you maybe have another way of achieving the same goal. Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    PyTorch implementation of Dreamer-v1 and Dreamer-v2 algorithms for Dm_Control suite tasks
    Implementation Repo: https://github.com/adityabingi/Dreamer This work is a reproduction and comparison of Dreamerv1 and v2 for continuous control tasks of the dm_control suite. Training of both algorithms is done entirely on single free GPUs of google colab for 100k timesteps due to colab's strict timeouts. Performance comparison plots across 5 continuous control tasks from dm_control can be found in the repo Hope this is useful for those trying to reproduce these algorithms submitted by /u/aditya_bingi [link] [comments]  ( 1 min )
    Staying up to date in the field
    On average how many papers do you look to read a day/week/ever? How do you go about staying upto date? So many papers get released it’s often hard to know if it’s worth reading. submitted by /u/Ok_Cartoonist_9279 [link] [comments]  ( 1 min )
    CQL - Evaluating the value of the next action at the current state ?
    Hello everyone, I'm currently trying to apply CQL to a custom domain where I need to reimplement the algorithm to fit my problem. However, I noticed something in most implementations that I found that doesn't make too much sense to me and that is not mentioned in the paper AFAIK. The importance sampling part is usually implemented as follow current_actions, current_log_pis = self.policy(observations) next_actions, next_log_pis = self.policy(next_observations) [...] q1_rand = self.qf1(observations, random_actions) q2_rand = self.qf2(observations, random_actions) q1_current_actions = self.qf1(observations, current_actions) q2_current_actions = self.qf2(observations, current_actions) q1_next_actions = self.qf1(observations, next_actions) q2_next_actions = self.qf2(observations, next_action…  ( 1 min )
    An interesting issue with REINFORCE (or vanilla policy gradient)
    Hey all, I'm working on implementing a few RL algorithms to play Mario bros. I faced a few issues in my REINFORCE implementation and took the time to document them. One of the issues that comes up is a loss explosion. I use the following code to train my Policy: ``python def train_step(self, data): """train_step runs viamodel.fit()`. It accepts x in the form of observations, and y in the form of a tuple of the actions and advantages """ observations, (actions, advantages) = data with tf.GradientTape() as tape: log_probs = self.action_distribution(observations).log_prob(actions) loss = log_probs * advantages loss = -tf.math.reduce_sum(loss, axis=-1) Make sure to add regularization losses loss += sum(self.network.losses) grads = tape.gradient(loss, self.trainable_weights) self.optimizer.apply_gradients(zip(grads, self.trainable_weights)) return {"loss": loss} ``` As you can see, I take the log_probs using a tensorflow_probability.distributions.Categorical.log_prob(). These values seems to explode to -infinity, causing the loss to be eventually tend towards -infinity when any of the actions consistently have a negative reward and a 0 probability. For further reading, I also documented this issue here: https://github.com/LukeWood/luig-io/tree/master/policy_gradient#loss-explosion Is this a common issue in the REINFORCE algorithm? From what I can tell, if the model learns to make the probability for a specific action 0, and the reward for that action is negative, the loss will over-prioritize getting this action close to zero - as the gradient at that point in the log function is massive. Thanks for any help figuring out this issue. submitted by /u/puppet_pals [link] [comments]  ( 1 min )
  • Open

    How do I improve the model from here? This is a basic model trying to classify sentences/tweets into 0 or 1.
    submitted by /u/NFSL2001 [link] [comments]  ( 1 min )
    How does a cnn decide the multiplying factors inside a filter matrix
    So im learning about CNN's and how it helps with "seeing" some defining features. What i dont understand is one i've declared tf.keras.layers.Conv2D(64, (3,3), activation='relu', padding="same", input_shape=input_shape[1:])(x) or whatever, how does keras now which is the optimal 3x3 filter matrix to apply. And if it learns, how does it learn? Does it apply a random filter, see how it performs on the DNN and update accordingly much like a nn updates its correlation between neurons? I need help please! Thank you xx submitted by /u/sexandwallstreet [link] [comments]  ( 1 min )
  • Open

    Vision in the Making: Andrew Ng’s Startup Automates Factory Inspection
    Computer vision specialist Landing AI has a unique calling card: Its co-founder and CEO is a tech rock star. At Google Brain, Andrew Ng became famous for showing how deep learning could recognize cats in a sea of images with uncanny speed and accuracy. Later, he founded Coursera, where his machine learning courses have attracted Read article > The post Vision in the Making: Andrew Ng’s Startup Automates Factory Inspection appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    A footnote to year share
    A couple weeks ago I wrote a post about the year share component of calculating the day of the week. To calculate the day of the week, you need to add the day of the week, a constant for the month, and the year share. Calculating year share is not that hard, but it’s the […] A footnote to year share first appeared on John D. Cook.  ( 2 min )
  • Open

    The Top 4 AI Projects Everyone Is Talking About
    Artificial intelligence is one of the most talked-about technologies today. And for good reason — it has the potential to change the world…  ( 6 min )
  • Open

    Deep Learning Prediction of Severe Health Risks for Pediatric COVID-19 Patients with a Large Feature Set in 2021 BARDA Data Challenge. (arXiv:2206.01696v1 [cs.LG])
    Most children infected with COVID-19 have no or mild symptoms and can recover automatically by themselves, but some pediatric COVID-19 patients need to be hospitalized or even to receive intensive medical care (e.g., invasive mechanical ventilation or cardiovascular support) to recover from the illnesses. Therefore, it is critical to predict the severe health risk that COVID-19 infection poses to children to provide precise and timely medical care for vulnerable pediatric COVID-19 patients. However, predicting the severe health risk for COVID-19 patients including children remains a significant challenge because many underlying medical factors affecting the risk are still largely unknown. In this work, instead of searching for a small number of most useful features to make prediction, we design a novel large-scale bag-of-words like method to represent various medical conditions and measurements of COVID-19 patients. After some simple feature filtering based on logistical regression, the large set of features is used with a deep learning method to predict both the hospitalization risk for COVID-19 infected children and the severe complication risk for the hospitalized pediatric COVID-19 patients. The method was trained and tested on the datasets of the Biomedical Advanced Research and Development Authority (BARDA) Pediatric COVID-19 Data Challenge held from Sept. 15 to Dec. 17, 2021. The results show that the approach can rather accurately predict the risk of hospitalization and severe complication for pediatric COVID-19 patients and deep learning is more accurate than other machine learning methods.  ( 2 min )
    A Fair Empirical Risk Minimization with Generalized Entropy. (arXiv:2202.11966v2 [cs.LG] UPDATED)
    Recently a parametric family of fairness metrics to quantify algorithmic fairness has been proposed based on generalized entropy which have been originally used in economics and public welfare. Since these metrics have several advantages such as quantifying unfairness at the individual-level and group-level, and unfold trade-off between the individual fairness and group-level fairness, algorithmic fairness requirement may be given in terms of generalized entropy for a fair classification problem. We consider a fair empirical risk minimization with a fairness constraint specified by generalized entropy. We theoretically investigate if the fair empirical fair classification problem is learnable and how to find an approximate optimal classifier of it.  ( 2 min )
    Zero-Shot Bird Species Recognition by Learning from Field Guides. (arXiv:2206.01466v1 [cs.CV])
    We exploit field guides to learn bird species recognition, in particular zero-shot recognition of unseen species. The illustrations contained in field guides deliberately focus on discriminative properties of a species, and can serve as side information to transfer knowledge from seen to unseen classes. We study two approaches: (1) a contrastive encoding of illustrations that can be fed into zero-shot learning schemes; and (2) a novel method that leverages the fact that illustrations are also images and as such structurally more similar to photographs than other kinds of side information. Our results show that illustrations from field guides, which are readily available for a wide range of species, are indeed a competitive source of side information. On the iNaturalist2021 subset, we obtain a harmonic mean from 749 seen and 739 unseen classes greater than $45\%$ (@top-10) and $15\%$ (@top-1). Which shows that field guides are a valuable option for challenging real-world scenarios with many species.  ( 2 min )
    Revisiting the "Video" in Video-Language Understanding. (arXiv:2206.01720v1 [cs.CV])
    What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.  ( 2 min )
    Rashomon Capacity: A Metric for Predictive Multiplicity in Probabilistic Classification. (arXiv:2206.01295v1 [cs.LG])
    Predictive multiplicity occurs when classification models with nearly indistinguishable average performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new measure of predictive multiplicity in probabilistic classification called Rashomon Capacity. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure, report, and ultimately resolve predictive multiplicity prior to model deployment.  ( 2 min )
    Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post hoc Explanations. (arXiv:2206.01254v1 [cs.LG])
    Despite the plethora of post hoc model explanation methods, the basic properties and behavior of these methods and the conditions under which each one is effective are not well understood. In this work, we bridge these gaps and address a fundamental question: Which explanation method should one use in a given situation? To this end, we adopt a function approximation perspective and formalize the local function approximation (LFA) framework. We show that popular explanation methods are instances of this framework, performing function approximations of the underlying model in different neighborhoods using different loss functions. We introduce a no free lunch theorem for explanation methods which demonstrates that no single method can perform optimally across all neighbourhoods and calls for choosing among methods. To choose among methods, we set forth a guiding principle based on the function approximation perspective, considering a method to be effective if it recovers the underlying model when the model is a member of the explanation function class. Then, we analyze the conditions under which popular explanation methods are effective and provide recommendations for choosing among explanation methods and creating new ones. Lastly, we empirically validate our theoretical results using various real world datasets, model classes, and prediction tasks. By providing a principled mathematical framework which unifies diverse explanation methods, our work characterizes the behaviour of these methods and their relation to one another, guides the choice of explanation methods, and paves the way for the creation of new ones.  ( 2 min )
    Reinforcement Learning with Fast Stabilization in Linear Dynamical Systems. (arXiv:2007.12291v2 [cs.LG] UPDATED)
    In this work, we study model-based reinforcement learning (RL) in unknown stabilizable linear dynamical systems. When learning a dynamical system, one needs to stabilize the unknown dynamics in order to avoid system blow-ups. We propose an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment with an improved exploration strategy. We show that the proposed algorithm attains $\tilde{\mathcal{O}}(\sqrt{T})$ regret after $T$ time steps of agent-environment interaction. We also show that the regret of the proposed algorithm has only a polynomial dependence in the problem dimensions, which gives an exponential improvement over the prior methods. Our improved exploration method is simple, yet efficient, and it combines a sophisticated exploration policy in RL with an isotropic exploration strategy to achieve fast stabilization and improved regret. We empirically demonstrate that the proposed algorithm outperforms other popular methods in several adaptive control tasks.  ( 2 min )
    Instance-dependent Label-noise Learning under a Structural Causal Model. (arXiv:2109.02986v3 [stat.ML] UPDATED)
    Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpful to learn the classifier and thus reduce the side effect of label noise. However, it remains elusive on how to exploit the causal information to handle the label noise problem. In this paper, by leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning. In particular, we show that properly modeling the instances will contribute to the identifiability of the label noise transition matrix and thus lead to a better classifier. Empirically, our method outperforms all state-of-the-art methods on both synthetic and real-world label-noise datasets.  ( 2 min )
    Compositional Visual Generation with Composable Diffusion Models. (arXiv:2206.01714v1 [cs.CV])
    Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation.  ( 2 min )
    Three-dimensional microstructure generation using generative adversarial neural networks in the context of continuum micromechanics. (arXiv:2206.01693v1 [cond-mat.mtrl-sci])
    Multiscale simulations are demanding in terms of computational resources. In the context of continuum micromechanics, the multiscale problem arises from the need of inferring macroscopic material parameters from the microscale. If the underlying microstructure is explicitly given by means of microCT-scans, convolutional neural networks can be used to learn the microstructure-property mapping, which is usually obtained from computational homogenization. The CNN approach provides a significant speedup, especially in the context of heterogeneous or functionally graded materials. Another application is uncertainty quantification, where many expansive evaluations are required. However, one bottleneck of this approach is the large number of training microstructures needed. This work closes this gap by proposing a generative adversarial network tailored towards three-dimensional microstructure generation. The lightweight algorithm is able to learn the underlying properties of the material from a single microCT-scan without the need of explicit descriptors. During prediction time, the network can produce unique three-dimensional microstructures with the same properties of the original data in a fraction of seconds and at consistently high quality.  ( 2 min )
    Hydra: A System for Large Multi-Model Deep Learning. (arXiv:2110.08633v6 [cs.DC] UPDATED)
    Scaling up model depth and size is now a common approach to raise accuracy in many deep learning (DL) applications, as evidenced by the widespread success of multi-billion or even trillion parameter models in natural language processing (NLP) research. Despite success in DL research and at major technology companies, broader practical adoption of such large models among domain scientists and businesses is still bottlenecked by GPU memory limits, high training costs, and low GPU availability, even on public clouds. Model selection needs further compound these resource challenges: users often need to compare dozens of models with different hyper-parameters or neural architectures to suit their specific task and dataset. In this paper, we present Hydra, a system designed to tackle such challenges by enabling out-of-the-box scaling for multi-large-model DL workloads on even commodity GPUs in a resource-efficient manner. Hydra is the first approach to holistically optimize the execution of multi-model workloads for large DL models. We do this by adapting prior "model-parallel" execution schemes to work with scalable parameter offloading across the memory hierarchy and further hybridizing this approach with task-parallel job scheduling techniques. Hydra decouples scalability of model parameters from parallelism of execution, thus enabling DL users to train even a 6-billion parameter model on a single commodity GPU. It also fully exploits the speedup potential of task parallelism in multi-GPU setups, yielding near-linear strong scaling and making rigorous model selection perhaps more practical for such models. We evaluate end-to-end performance by fine-tuning GPT-2 for language modeling. We find that Hydra offers between 50% and 100% higher training throughput than even the best settings of state-of-the-art industrial frameworks such as DeepSpeed and GPipe for multi-large-model training.  ( 3 min )
    Learning Soft Constraints From Constrained Expert Demonstrations. (arXiv:2206.01311v1 [cs.LG])
    Inverse reinforcement learning (IRL) methods assume that the expert data is generated by an agent optimizing some reward function. However, in many settings, the agent may optimize a reward function subject to some constraints, where the constraints induce behaviors that may be otherwise difficult to express with just a reward function. We consider the setting where the reward function is given, and the constraints are unknown, and propose a method that is able to recover these constraints satisfactorily from the expert data. While previous work has focused on recovering hard constraints, our method can recover cumulative soft constraints that the agent satisfies on average per episode. In IRL fashion, our method solves this problem by adjusting the constraint function iteratively through a constrained optimization procedure, until the agent behavior matches the expert behavior. Despite the simplicity of the formulation, our method is able to obtain good results. We demonstrate our approach on synthetic environments and real world highway driving data.  ( 2 min )
    NanoBatch Privacy: Enabling fast Differentially Private learning on the IPU. (arXiv:2109.12191v2 [cs.LG] UPDATED)
    Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs. Micro-batching is a common method to alleviate this and is fully supported in the TensorFlow Privacy library (TFDP). However, it degrades accuracy. We propose NanoBatch Privacy, a lightweight add-on to TFDP to be used on Graphcore IPUs by leveraging batch size of 1 (without microbatching) and gradient accumulation. This allows us to achieve large total batch sizes with minimal impacts to throughput. Second, we illustrate using Cifar-10 how larger batch sizes are not necessarily optimal from a privacy versus utility perspective. On ImageNet, we achieve more than 15x speedup over TFDP versus 8x A100s and significant speedups even across libraries such as Opacus. We also provide two extensions: 1) DPSGD for pipelined models and 2) per-layer clipping that is 15x faster than the Opacus implementation on 8x A100s. Finally as an application case study, we apply NanoBatch training for use on private Covid-19 chest CT prediction.  ( 2 min )
    Supernet Training for Federated Image Classification under System Heterogeneity. (arXiv:2206.01366v1 [cs.LG])
    Efficient deployment of deep neural networks across many devices and resource constraints, especially on edge devices, is one of the most challenging problems in the presence of data-privacy preservation issues. Conventional approaches have evolved to either improve a single global model while keeping each local training data decentralized (i.e., data-heterogeneity) or to train a once-for-all network that supports diverse architectural settings to address heterogeneous systems equipped with different computational capabilities (i.e., model-heterogeneity). However, little research has considered both directions simultaneously. In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup), where clients send and receive a supernet whereby it contains all possible architectures sampled from itself. It is inspired by how averaging parameters in the model aggregation stage of Federated Learning (FL) is similar to weight-sharing in supernet training. Specifically, in the FedSup framework, a weight-sharing approach widely used in the training single shot model is combined with the averaging of Federated Learning (FedAvg). Under our framework, we present an efficient algorithm (E-FedSup) by sending the sub-model to clients in the broadcast stage for reducing communication costs and training overhead. We demonstrate several strategies to enhance supernet training in the FL environment and conduct extensive empirical evaluations. The resulting framework is shown to pave the way for the robustness of both data- and model-heterogeneity on several standard benchmarks.  ( 2 min )
    Compressive Fourier collocation methods for high-dimensional diffusion equations with periodic boundary conditions. (arXiv:2206.01255v1 [math.NA])
    High-dimensional Partial Differential Equations (PDEs) are a popular mathematical modelling tool, with applications ranging from finance to computational chemistry. However, standard numerical techniques for solving these PDEs are typically affected by the curse of dimensionality. In this work, we tackle this challenge while focusing on stationary diffusion equations defined over a high-dimensional domain with periodic boundary conditions. Inspired by recent progress in sparse function approximation in high dimensions, we propose a new method called compressive Fourier collocation. Combining ideas from compressive sensing and spectral collocation, our method replaces the use of structured collocation grids with Monte Carlo sampling and employs sparse recovery techniques, such as orthogonal matching pursuit and $\ell^1$ minimization, to approximate the Fourier coefficients of the PDE solution. We conduct a rigorous theoretical analysis showing that the approximation error of the proposed method is comparable with the best $s$-term approximation (with respect to the Fourier basis) to the solution. Using the recently introduced framework of random sampling in bounded Riesz systems, our analysis shows that the compressive Fourier collocation method mitigates the curse of dimensionality with respect to the number of collocation points under sufficient conditions on the regularity of the diffusion coefficient. We also present numerical experiments that illustrate the accuracy and stability of the method for the approximation of sparse and compressible solutions.  ( 2 min )
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero. (arXiv:1902.04522v5 [cs.AI] UPDATED)
    The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning's capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available at https://ai.facebook.com/tools/elf-opengo/.
    Learning programs by combining programs. (arXiv:2206.01614v1 [cs.LG])
    The goal of inductive logic programming is to induce a set of rules (a logic program) that generalises examples. Inducing programs with many rules and literals is a major challenge. To tackle this challenge, we decompose programs into \emph{non-separable} fragments, learn fragments separately, and then combine them. We implement our approach in a generate, test, combine, and constrain loop. Our anytime approach can learn optimal, recursive, and large programs and supports predicate invention. Our experiments on multiple domains (including program synthesis and inductive general game playing) show that our approach can increase predictive accuracies and reduce learning times compared to existing approaches.
    What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods. (arXiv:2112.04417v2 [cs.CV] UPDATED)
    A multitude of explainability methods and associated fidelity performance metrics have been proposed to help better understand how modern AI systems make decisions. However, much of the current work has remained theoretical -- without much consideration for the human end-user. In particular, it is not yet known (1) how useful current explainability methods are in practice for more real-world scenarios and (2) how well associated performance metrics accurately predict how much knowledge individual explanations contribute to a human end-user trying to understand the inner-workings of the system. To fill this gap, we conducted psychophysics experiments at scale to evaluate the ability of human participants to leverage representative attribution methods for understanding the behavior of different image classifiers representing three real-world scenarios: identifying bias in an AI system, characterizing the visual strategy it uses for tasks that are too difficult for an untrained non-expert human observer as well as understanding its failure cases. Our results demonstrate that the degree to which individual attribution methods help human participants better understand an AI system varied widely across these scenarios. This suggests a critical need for the field to move past quantitative improvements of current attribution methods towards the development of complementary approaches that provide qualitatively different sources of information to human end-users.
    A Fast and Convergent Proximal Algorithm for Regularized Nonconvex and Nonsmooth Bi-level Optimization. (arXiv:2203.16615v2 [cs.LG] UPDATED)
    Many important machine learning applications involve regularized nonconvex bi-level optimization. However, the existing gradient-based bi-level optimization algorithms cannot handle nonconvex or nonsmooth regularizers, and they suffer from a high computation complexity in nonconvex bi-level optimization. In this work, we study a proximal gradient-type algorithm that adopts the approximate implicit differentiation (AID) scheme for nonconvex bi-level optimization with possibly nonconvex and nonsmooth regularizers. In particular, the algorithm applies the Nesterov's momentum to accelerate the computation of the implicit gradient involved in AID. We provide a comprehensive analysis of the global convergence properties of this algorithm through identifying its intrinsic potential function. In particular, we formally establish the convergence of the model parameters to a critical point of the bi-level problem, and obtain an improved computation complexity $\mathcal{O}(\kappa^{3.5}\epsilon^{-2})$ over the state-of-the-art result. Moreover, we analyze the asymptotic convergence rates of this algorithm under a class of local nonconvex geometries characterized by a {\L}ojasiewicz-type gradient inequality. Experiment on hyper-parameter optimization demonstrates the effectiveness of our algorithm.
    Causal Transformer for Estimating Counterfactual Outcomes. (arXiv:2204.07258v2 [cs.LG] UPDATED)
    Estimating counterfactual outcomes over time from observational data is relevant for many applications (e.g., personalized medicine). Yet, state-of-the-art methods build upon simple long short-term memory (LSTM) networks, thus rendering inferences for complex, long-range dependencies challenging. In this paper, we develop a novel Causal Transformer for estimating counterfactual outcomes over time. Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders. For this, we combine three transformer subnetworks with separate inputs for time-varying covariates, previous treatments, and previous outcomes into a joint network with in-between cross-attentions. We further develop a custom, end-to-end training procedure for our Causal Transformer. Specifically, we propose a novel counterfactual domain confusion loss to address confounding bias: it aims to learn adversarial balanced representations, so that they are predictive of the next outcome but non-predictive of the current treatment assignment. We evaluate our Causal Transformer based on synthetic and real-world datasets, where it achieves superior performance over current baselines. To the best of our knowledge, this is the first work proposing transformer-based architecture for estimating counterfactual outcomes from longitudinal data.
    An alternative approach to train neural networks using monotone variational inequality. (arXiv:2202.08876v2 [stat.ML] UPDATED)
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. The current paper investigates an alternative approach of neural network training by reducing to another problem with convex structure -- to solve a monotone variational inequality (MVI) -- inspired by a recent work of (Juditsky & Nemirovsky, 2019). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery and prediction accuracy under the theoretical setting of training a single-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrate its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to widely-used stochastic gradient descent methods on both synthetic and real network data prediction tasks regarding various performance metrics, especially in the improved efficiency in the early stage of training.
    Slot Order Matters for Compositional Scene Understanding. (arXiv:2206.01370v1 [cs.CV])
    Empowering agents with a compositional understanding of their environment is a promising next step toward solving long-horizon planning problems. On the one hand, we have seen encouraging progress on variational inference algorithms for obtaining sets of object-centric latent representations ("slots") from unstructured scene observations. On the other hand, generating scenes from slots has received less attention, in part because it is complicated by the lack of a canonical object order. A canonical object order is useful for learning the object correlations necessary to generate physically plausible scenes similar to how raster scan order facilitates learning pixel correlations for pixel-level autoregressive image generation. In this work, we address this lack by learning a fixed object order for a hierarchical variational autoencoder with a single level of autoregressive slots and a global scene prior. We cast autoregressive slot inference as a set-to-sequence modeling problem. We introduce an auxiliary loss to train the slot prior to generate objects in a fixed order. During inference, we align a set of inferred slots to the object order obtained from a slot prior rollout. To ensure the rolled out objects are meaningful for the given scene, we condition the prior on an inferred global summary of the input. Experiments on compositional environments and ablations demonstrate that our model with global prior, inference with aligned slot order, and auxiliary loss achieves state-of-the-art sample quality.
    It's DONE: Direct ONE-shot learning with Hebbian weight imprinting. (arXiv:2204.13361v2 [cs.LG] UPDATED)
    Learning a new concept from one example is a superior function of human brain and it is drawing attention in the field of machine learning as one-shot learning task. In this paper, we propose the simplest method for this task with a nonparametric weight imprinting, named Direct ONE-shot learning (DONE). DONE adds new classes to a pretrained deep neural network (DNN) classifier with neither training optimization nor pretrained-DNN modification. DONE is inspired by Hebbian theory and directly uses the neural activity input of the final dense layer obtained from a data that belongs to the new additional class as the connectivity weight (synaptic strength) with a newly-provided-output neuron for the new class, by transforming all statistical properties of the neural activity into those of synaptic strength. DONE requires just one inference for learning a new concept and its procedure is simple, deterministic, not requiring parameter tuning and hyperparameters. The performance of DONE depends entirely on the pretrained DNN model used as a backbone model, and we confirmed that DONE with a well-trained backbone model performs a practical-level accuracy. DONE has some advantages including a DNN's practical use that is difficult to spend high cost for a training, an evaluation of existing DNN models, and the understanding of the brain. DONE might be telling us one-shot learning is an easy task that can be achieved by a simple principle not only for humans but also for current well-trained DNN models.
    JARVix at SemEval-2022 Task 2: It Takes One to Know One? Idiomaticity Detection using Zero and One Shot Learning. (arXiv:2202.02394v4 [cs.CL] UPDATED)
    Large Language Models have been successful in a wide variety of Natural Language Processing tasks by capturing the compositionality of the text representations. In spite of their great success, these vector representations fail to capture meaning of idiomatic multi-word expressions (MWEs). In this paper, we focus on the detection of idiomatic expressions by using binary classification. We use a dataset consisting of the literal and idiomatic usage of MWEs in English and Portuguese. Thereafter, we perform the classification in two different settings: zero shot and one shot, to determine if a given sentence contains an idiom or not. N shot classification for this task is defined by N number of common idioms between the training and testing sets. In this paper, we train multiple Large Language Models in both the settings and achieve an F1 score (macro) of 0.73 for the zero shot setting and an F1 score (macro) of 0.85 for the one shot setting. An implementation of our work can be found at https://github.com/ashwinpathak20/Idiomaticity_Detection_Using_Few_Shot_Learning.
    HierAttn: Effectively Learn Representations from Stage Attention and Branch Attention for Skin Lesions Diagnosis. (arXiv:2205.04326v5 [eess.IV] UPDATED)
    Accurate and unbiased examinations of skin lesions are critical for the early diagnosis and treatment of skin conditions and disorders. Visual features of skin lesions vary significantly because the images are collected from patients with different lesion colours and morphologies by using dissimilar imaging equipment. Recent studies have reported ensembled convolutional neural networks (CNNs) to classify the images for early diagnosis of skin disorders. However, the practical use of these ensembled CNNs is limited because they are heavyweight and inadequate for using contextual information. Although lightweight networks (e.g., MobileNetV3 and EfficientNet) were developed to achieve parameters reduction for implementing deep neural networks on mobile devices, insufficient depth of feature representation restricts the performance. To address the existing limitations, we introduce a new lite and effective neural network, namely HierAttn. The HierAttn applies a novel strategy to learn the local and global features by using multi-stage and multi-branch attention mechanisms. The efficacy of HierAttn was evaluated by using the dermoscopy images dataset ISIC2019 and smartphone photos dataset PAD-UFES-20 (PAD20). The experimental results show that HierAttn achieves the best accuracy and AUC among the state-of-the-art lightweight networks. The code is available at https://github.com/anthonyweidai/HierAttn.
    On the Generalization of Wasserstein Robust Federated Learning. (arXiv:2206.01432v1 [cs.LG])
    In federated learning, participating clients typically possess non-i.i.d. data, posing a significant challenge to generalization to unseen distributions. To address this, we propose a Wasserstein distributionally robust optimization scheme called WAFL. Leveraging its duality, we frame WAFL as an empirical surrogate risk minimization problem, and solve it using a local SGD-based algorithm with convergence guarantees. We show that the robustness of WAFL is more general than related approaches, and the generalization bound is robust to all adversarial distributions inside the Wasserstein ball (ambiguity set). Since the center location and radius of the Wasserstein ball can be suitably modified, WAFL shows its applicability not only in robustness but also in domain adaptation. Through empirical evaluation, we demonstrate that WAFL generalizes better than the vanilla FedAvg in non-i.i.d. settings, and is more robust than other related methods in distribution shift settings. Further, using benchmark datasets we show that WAFL is capable of generalizing to unseen target domains.
    On the Benefits of Large Learning Rates for Kernel Methods. (arXiv:2202.13733v2 [stat.ML] UPDATED)
    This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.
    Non-Intrusive Reduced Models based on Operator Inference for Chaotic Systems. (arXiv:2206.01604v1 [cs.LG])
    This work explores the physics-driven machine learning technique Operator Inference (OpInf) for predicting the state of chaotic dynamical systems. OpInf provides a non-intrusive approach to infer approximations of polynomial operators in reduced space without having access to the full order operators appearing in discretized models. Datasets for the physics systems are generated using conventional numerical solvers and then projected to a low-dimensional space via Principal Component Analysis (PCA). In latent space, a least-squares problem is set to fit a quadratic polynomial operator which is subsequently employed in a time-integration scheme in order to produce extrapolations in the same space. Once solved, the inverse PCA operation is applied for reconstructing the extrapolations in the original space. The quality of the OpInf predictions is assessed via the Normalized Root Mean Squared Error (NRMSE) metric from which the Valid Prediction Time (VPT) is computed. Numerical experiments considering the chaotic systems Lorenz 96 and the Kuramoto-Sivashinsky equation show promising forecasting capabilities of the OpInf reduced order models with VPT ranges that outperform state-of-the-art machine learning methods such as backpropagation and reservoir computing recurrent neural networks [1]. The best results based on randomized initial conditions show that Lorenz 96 system can be forecasted up to 6.66 or 3.19 Lyapunov time units corresponding to the forcing terms F=8 and F=10, respectively, while the KS system achieved remarkable 794 Lyapunov time units.  ( 2 min )
    Alternating Synthetic and Real Gradients for Neural Language Modeling. (arXiv:1902.10630v2 [cs.LG] UPDATED)
    Training recurrent neural networks (RNNs) with backpropagation through time (BPTT) has known drawbacks such as being difficult to capture longterm dependencies in sequences. Successful alternatives to BPTT have not yet been discovered. Recently, BP with synthetic gradients by a decoupled neural interface module has been proposed to replace BPTT for training RNNs. On the other hand, it has been shown that the representations learned with synthetic and real gradients are different though they are functionally identical. In this project, we explore ways of combining synthetic and real gradients with application to neural language modeling tasks. Empirically, we demonstrate the effectiveness of alternating training with synthetic and real gradients after periodic warm restarts on language modeling tasks.
    Measuring Unintended Memorisation of Unique Private Features in Neural Networks. (arXiv:2202.08099v1 [cs.LG] CROSS LISTED)
    Neural networks pose a privacy risk to training data due to their propensity to memorise and leak information. Focusing on image classification, we show that neural networks also unintentionally memorise unique features even when they occur only once in training data. An example of a unique feature is a person's name that is accidentally present on a training image. Assuming access to the inputs and outputs of a trained model, the domain of the training data, and knowledge of unique features, we develop a score estimating the model's sensitivity to a unique feature by comparing the KL divergences of the model's output distributions given modified out-of-distribution images. Our results suggest that unique features are memorised by multi-layer perceptrons and convolutional neural networks trained on benchmark datasets, such as MNIST, Fashion-MNIST and CIFAR-10. We find that strategies to prevent overfitting (e.g.\ early stopping, regularisation, batch normalisation) do not prevent memorisation of unique features. These results imply that neural networks pose a privacy risk to rarely occurring private information. These risks can be more pronounced in healthcare applications if patient information is present in the training data.
    Safety Certification for Stochastic Systems via Neural Barrier Functions. (arXiv:2206.01463v1 [eess.SY])
    Providing non-trivial certificates of safety for non-linear stochastic systems is an important open problem that limits the wider adoption of autonomous systems in safety-critical applications. One promising solution to address this problem is barrier functions. The composition of a barrier function with a stochastic system forms a supermartingale, thus enabling the computation of the probability that the system stays in a safe set over a finite time horizon via martingale inequalities. However, existing approaches to find barrier functions for stochastic systems generally rely on convex optimization programs that restrict the search of a barrier to a small class of functions such as low degree SoS polynomials and can be computationally expensive. In this paper, we parameterize a barrier function as a neural network and show that techniques for robust training of neural networks can be successfully employed to find neural barrier functions. Specifically, we leverage bound propagation techniques to certify that a neural network satisfies the conditions to be a barrier function via linear programming and then employ the resulting bounds at training time to enforce the satisfaction of these conditions. We also present a branch-and-bound scheme that makes the certification framework scalable. We show that our approach outperforms existing methods in several case studies and often returns certificates of safety that are orders of magnitude larger.
    Meta-Auto-Decoder for Solving Parametric Partial Differential Equations. (arXiv:2111.08823v2 [cs.LG] UPDATED)
    Partial Differential Equations (PDEs) are ubiquitous in many disciplines of science and engineering and notoriously difficult to solve. In general, closed-form solutions of PDEs are unavailable and numerical approximation methods are computationally expensive. The parameters of PDEs are variable in many applications, such as inverse problems, control and optimization, risk assessment, and uncertainty quantification. In these applications, our goal is to solve parametric PDEs rather than one instance of them. Our proposed approach, called Meta-Auto-Decoder (MAD), treats solving parametric PDEs as a meta-learning problem and utilizes the Auto-Decoder structure in \cite{park2019deepsdf} to deal with different tasks/PDEs. Physics-informed losses induced from the PDE governing equations and boundary conditions is used as the training losses for different tasks. The goal of MAD is to learn a good model initialization that can generalize across different tasks, and eventually enables the unseen task to be learned faster. The inspiration of MAD comes from (conjectured) low-dimensional structure of parametric PDE solutions and we explain our approach from the perspective of manifold learning. Finally, we demonstrate the power of MAD though extensive numerical studies, including Burgers' equation, Laplace's equation and time-domain Maxwell's equations. MAD exhibits faster convergence speed without losing the accuracy compared with other deep learning methods.
    PAC Statistical Model Checking of Mean Payoff in Discrete- and Continuous-Time MDP. (arXiv:2206.01465v1 [eess.SY])
    Markov decision processes (MDP) and continuous-time MDP (CTMDP) are the fundamental models for non-deterministic systems with probabilistic uncertainty. Mean payoff (a.k.a. long-run average reward) is one of the most classic objectives considered in their context. We provide the first algorithm to compute mean payoff probably approximately correctly in unknown MDP; further, we extend it to unknown CTMDP. We do not require any knowledge of the state space, only a lower bound on the minimum transition probability, which has been advocated in literature. In addition to providing probably approximately correct (PAC) bounds for our algorithm, we also demonstrate its practical nature by running experiments on standard benchmarks.
    Learning with convolution and pooling operations in kernel methods. (arXiv:2111.08308v2 [stat.ML] UPDATED)
    Recent empirical work has shown that hierarchical convolutional kernels inspired by convolutional neural networks (CNNs) significantly improve the performance of kernel methods in image classification tasks. A widely accepted explanation for their success is that these architectures encode hypothesis classes that are suitable for natural images. However, understanding the precise interplay between approximation and generalization in convolutional architectures remains a challenge. In this paper, we consider the stylized setting of covariates (image pixels) uniformly distributed on the hypercube, and characterize exactly the RKHS of kernels composed of single layers of convolution, pooling, and downsampling operations. We use this characterization to compute sharp asymptotics of the generalization error for any given function in high-dimension. In particular, we quantify the gain in sample complexity brought by enforcing locality with the convolution operation and approximate translation invariance with average pooling. Notably, these results provide a precise description of how convolution and pooling operations trade off approximation with generalization power in one layer convolutional kernels.
    BioADAPT-MRC: Adversarial Learning-based Domain Adaptation Improves Biomedical Machine Reading Comprehension Task. (arXiv:2202.13174v2 [cs.CL] UPDATED)
    Biomedical machine reading comprehension (biomedical-MRC) aims to comprehend complex biomedical narratives and assist healthcare professionals in retrieving information from them. The high performance of modern neural network-based MRC systems depends on high-quality, large-scale, human-annotated training datasets. In the biomedical domain, a crucial challenge in creating such datasets is the requirement for domain knowledge, inducing the scarcity of labeled data and the need for transfer learning from the labeled general-purpose (source) domain to the biomedical (target) domain. However, there is a discrepancy in marginal distributions between the general-purpose and biomedical domains due to the variances in topics. Therefore, direct-transferring of learned representations from a model trained on a general-purpose domain to the biomedical domain can hurt the model's performance. We present an adversarial learning-based domain adaptation framework for the biomedical machine reading comprehension task (BioADAPT-MRC), a neural network-based method to address the discrepancies in the marginal distributions between the general and biomedical domain datasets. BioADAPT-MRC relaxes the need for generating pseudo labels for training a well-performing biomedical-MRC model. We extensively evaluate the performance of BioADAPT-MRC by comparing it with the best existing methods on three widely used benchmark biomedical-MRC datasets -- BioASQ-7b, BioASQ-8b, and BioASQ-9b. Our results suggest that without using any synthetic or human-annotated data from the biomedical domain, BioADAPT-MRC can achieve state-of-the-art performance on these datasets. Availability: BioADAPT-MRC is freely available as an open-source project at \url{https://github.com/mmahbub/BioADAPT-MRC}.
    Improving Fairness in Large-Scale Object Recognition by CrowdSourced Demographic Information. (arXiv:2206.01326v1 [cs.CV])
    There has been increasing awareness of ethical issues in machine learning, and fairness has become an important research topic. Most fairness efforts in computer vision have been focused on human sensing applications and preventing discrimination by people's physical attributes such as race, skin color or age by increasing visual representation for particular demographic groups. We argue that ML fairness efforts should extend to object recognition as well. Buildings, artwork, food and clothing are examples of the objects that define human culture. Representing these objects fairly in machine learning datasets will lead to models that are less biased towards a particular culture and more inclusive of different traditions and values. There exist many research datasets for object recognition, but they have not carefully considered which classes should be included, or how much training data should be collected per class. To address this, we propose a simple and general approach, based on crowdsourcing the demographic composition of the contributors: we define fair relevance scores, estimate them, and assign them to each class. We showcase its application to the landmark recognition domain, presenting a detailed analysis and the final fairer landmark rankings. We present analysis which leads to a much fairer coverage of the world compared to existing datasets. The evaluation dataset was used for the 2021 Google Landmark Challenges, which was the first of a kind with an emphasis on fairness in generic object recognition.
    Instant Graph Neural Networks for Dynamic Graphs. (arXiv:2206.01379v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely used for modeling graph-structured data. With the development of numerous GNN variants, recent years have witnessed groundbreaking results in improving the scalability of GNNs to work on static graphs with millions of nodes. However, how to instantly represent continuous changes of large-scale dynamic graphs with GNNs is still an open problem. Existing dynamic GNNs focus on modeling the periodic evolution of graphs, often on a snapshot basis. Such methods suffer from two drawbacks: first, there is a substantial delay for the changes in the graph to be reflected in the graph representations, resulting in losses on the model's accuracy; second, repeatedly calculating the representation matrix on the entire graph in each snapshot is predominantly time-consuming and severely limits the scalability. In this paper, we propose Instant Graph Neural Network (InstantGNN), an incremental computation approach for the graph representation matrix of dynamic graphs. Set to work with dynamic graphs with the edge-arrival model, our method avoids time-consuming, repetitive computations and allows instant updates on the representation and instant predictions. Graphs with dynamic structures and dynamic attributes are both supported. The upper bounds of time complexity of those updates are also provided. Furthermore, our method provides an adaptive training strategy, which guides the model to retrain at moments when it can make the greatest performance gains. We conduct extensive experiments on several real-world and synthetic datasets. Empirical results demonstrate that our model achieves state-of-the-art accuracy while having orders-of-magnitude higher efficiency than existing methods.
    Scalable Multirobot Planning for Informed Spatial Sampling. (arXiv:2105.10018v3 [cs.RO] UPDATED)
    This paper presents a distributed scalable multi-robot planning algorithm for informed sampling of quasistatic spatial fields. We address the problem of efficient data collection using multiple autonomous vehicles and consider the effects of communication between multiple robots, acting independently, on the overall sampling performance of the team. We focus on the distributed sampling problem where the robots operate independent of their teammates, but have the ability to communicate their current state to other neighbors within a fixed communication range. Our proposed approach is scalable and adaptive to various environmental scenarios, changing robot team configurations, and runs in real-time, which are important features for many real-world applications. We compare the performance of our proposed algorithm to baseline strategies through simulated experiments that utilize models derived from both synthetic and field deployment data. The results show that our sampling algorithm is efficient even when robots in the team are operating with a limited communication range, thus demonstrating the scalability of our method in sampling large-scale environments.
    Robust Multi-Objective Bayesian Optimization Under Input Noise. (arXiv:2202.07549v4 [cs.LG] UPDATED)
    Bayesian optimization (BO) is a sample-efficient approach for tuning design parameters to optimize expensive-to-evaluate, black-box performance metrics. In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected. Although BO methods have been proposed for optimizing a single objective under input noise, no existing method addresses the practical scenario where there are multiple objectives that are sensitive to input perturbations. In this work, we propose the first multi-objective BO method that is robust to input noise. We formalize our goal as optimizing the multivariate value-at-risk (MVaR), a risk measure of the uncertain objectives. Since directly optimizing MVaR is computationally infeasible in many settings, we propose a scalable, theoretically-grounded approach for optimizing MVaR using random scalarizations. Empirically, we find that our approach significantly outperforms alternative methods and efficiently identifies optimal robust designs that will satisfy specifications across multiple metrics with high probability.
    Adversarial Unlearning: Reducing Confidence Along Adversarial Directions. (arXiv:2206.01367v1 [cs.LG])
    Supervised learning methods trained with maximum likelihood objectives often overfit on training data. Most regularizers that prevent overfitting look to increase confidence on additional examples (e.g., data augmentation, adversarial training), or reduce it on training data (e.g., label smoothing). In this work we propose a complementary regularization strategy that reduces confidence on self-generated examples. The method, which we call RCAD (Reducing Confidence along Adversarial Directions), aims to reduce confidence on out-of-distribution examples lying along directions adversarially chosen to increase training loss. In contrast to adversarial training, RCAD does not try to robustify the model to output the original label, but rather regularizes it to have reduced confidence on points generated using much larger perturbations than in conventional adversarial training. RCAD can be easily integrated into training pipelines with a few lines of code. Despite its simplicity, we find on many classification benchmarks that RCAD can be added to existing techniques (e.g., label smoothing, MixUp training) to increase test accuracy by 1-3% in absolute value, with more significant gains in the low data regime. We also provide a theoretical analysis that helps to explain these benefits in simplified settings, showing that RCAD can provably help the model unlearn spurious features in the training data.
    Can Requirements Engineering Support Explainable Artificial Intelligence? Towards a User-Centric Approach for Explainability Requirements. (arXiv:2206.01507v1 [cs.SE])
    With the recent proliferation of artificial intelligence systems, there has been a surge in the demand for explainability of these systems. Explanations help to reduce system opacity, support transparency, and increase stakeholder trust. In this position paper, we discuss synergies between requirements engineering (RE) and Explainable AI (XAI). We highlight challenges in the field of XAI, and propose a framework and research directions on how RE practices can help to mitigate these challenges.
    Generalization for multiclass classification with overparameterized linear models. (arXiv:2206.01399v1 [cs.LG])
    Via an overparameterized linear model with Gaussian features, we provide conditions for good generalization for multiclass classification of minimum-norm interpolating solutions in an asymptotic setting where both the number of underlying features and the number of classes scale with the number of training points. The survival/contamination analysis framework for understanding the behavior of overparameterized learning problems is adapted to this setting, revealing that multiclass classification qualitatively behaves like binary classification in that, as long as there are not too many classes (made precise in the paper), it is possible to generalize well even in some settings where the corresponding regression tasks would not generalize. Besides various technical challenges, it turns out that the key difference from the binary classification setting is that there are relatively fewer positive training examples of each class in the multiclass setting as the number of classes increases, making the multiclass problem "harder" than the binary one.
    Evaluating Transfer-based Targeted Adversarial Perturbations against Real-World Computer Vision Systems based on Human Judgments. (arXiv:2206.01467v1 [cs.CV])
    Computer vision systems are remarkably vulnerable to adversarial perturbations. Transfer-based adversarial images are generated on one (source) system and used to attack another (target) system. In this paper, we take the first step to investigate transfer-based targeted adversarial images in a realistic scenario where the target system is trained on some private data with its inventory of semantic labels not publicly available. Our main contributions include an extensive human-judgment-based evaluation of attack success on the Google Cloud Vision API and additional analysis of the different behaviors of Google Cloud Vision in face of original images vs. adversarial images. Resources are publicly available at \url{https://github.com/ZhengyuZhao/Targeted-Tansfer/blob/main/google_results.zip}.
    Game of Privacy: Towards Better Federated Platform Collaboration under Privacy Restriction. (arXiv:2202.05139v3 [cs.LG] UPDATED)
    Vertical federated learning (VFL) aims to train models from cross-silo data with different feature spaces stored on different platforms. Existing VFL methods usually assume all data on each platform can be used for model training. However, due to the intrinsic privacy risks of federated learning, the total amount of involved data may be constrained. In addition, existing VFL studies usually assume only one platform has task labels and can benefit from the collaboration, making it difficult to attract other platforms to join in the collaborative learning. In this paper, we study the platform collaboration problem in VFL under privacy constraint. We propose to incent different platforms through a reciprocal collaboration, where all platforms can exploit multi-platform information in the VFL framework to benefit their own tasks. With limited privacy budgets, each platform needs to wisely allocate its data quotas for collaboration with other platforms. Thereby, they naturally form a multi-party game. There are two core problems in this game, i.e., how to appraise other platforms' data value to compute game rewards and how to optimize policies to solve the game. To evaluate the contributions of other platforms' data, each platform offers a small amount of "deposit" data to participate in the VFL. We propose a performance estimation method to predict the expected model performance when involving different amount combinations of inter-platform data. To solve the game, we propose a platform negotiation method that simulates the bargaining among platforms and locally optimizes their policies via gradient descent. Extensive experiments on two real-world datasets show that our approach can effectively facilitate the collaborative exploitation of multi-platform data in VFL under privacy restrictions.
    Optimal Weak to Strong Learning. (arXiv:2206.01563v1 [cs.LG])
    The classic algorithm AdaBoost allows to convert a weak learner, that is an algorithm that produces a hypothesis which is slightly better than chance, into a strong learner, achieving arbitrarily high accuracy when given enough training data. We present a new algorithm that constructs a strong learner from a weak learner but uses less training data than AdaBoost and all other weak to strong learners to achieve the same generalization bounds. A sample complexity lower bound shows that our new algorithm uses the minimum possible amount of training data and is thus optimal. Hence, this work settles the sample complexity of the classic problem of constructing a strong learner from a weak learner.
    Adaptive Learning for Discovery. (arXiv:2205.14829v2 [stat.ML] UPDATED)
    In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.
    Constraining Gaussian processes for physics-informed acoustic emission mapping. (arXiv:2206.01495v1 [cs.LG])
    The automated localisation of damage in structures is a challenging but critical ingredient in the path towards predictive or condition-based maintenance of high value structures. The use of acoustic emission time of arrival mapping is a promising approach to this challenge, but is severely hindered by the need to collect a dense set of artificial acoustic emission measurements across the structure, resulting in a lengthy and often impractical data acquisition process. In this paper, we consider the use of physics-informed Gaussian processes for learning these maps to alleviate this problem. In the approach, the Gaussian process is constrained to the physical domain such that information relating to the geometry and boundary conditions of the structure are embedded directly into the learning process, returning a model that guarantees that any predictions made satisfy physically-consistent behaviour at the boundary. A number of scenarios that arise when training measurement acquisition is limited, including where training data are sparse, and also of limited coverage over the structure of interest. Using a complex plate-like structure as an experimental case study, we show that our approach significantly reduces the burden of data collection, where it is seen that incorporation of boundary condition knowledge significantly improves predictive accuracy as training observations are reduced, particularly when training measurements are not available across all parts of the structure.
    Accelerating hydrodynamic simulations of urban drainage systems with physics-guided machine learning. (arXiv:2206.01538v1 [cs.LG])
    We propose and demonstrate a new approach for fast and accurate surrogate modelling of urban drainage system hydraulics based on physics-guided machine learning. The surrogates are trained against a limited set of simulation results from a hydrodynamic (HiFi) model. Our approach reduces simulation times by one to two orders of magnitude compared to a HiFi model. It is thus slower than e.g. conceptual hydrological models, but it enables simulations of water levels, flows and surcharges in all nodes and links of a drainage network and thus largely preserves the level of detail provided by HiFi models. Comparing time series simulated by the surrogate and the HiFi model, R2 values in the order of 0.9 are achieved. Surrogate training times are currently in the order of one hour. However, they can likely be reduced through the application of transfer learning and graph neural networks. Our surrogate approach will be useful for interactive workshops in initial design phases of urban drainage systems, as well as for real time applications. In addition, our model formulation is generic and future research should investigate its application for simulating other water systems.
    Reinforcement Learning with Neural Radiance Fields. (arXiv:2206.01634v1 [cs.LG])
    It is a long-standing problem to find effective representations for training reinforcement learning (RL) agents. This paper demonstrates that learning state representations with supervision from Neural Radiance Fields (NeRFs) can improve the performance of RL compared to other learned representations or even low-dimensional, hand-engineered state information. Specifically, we propose to train an encoder that maps multiple image observations to a latent space describing the objects in the scene. The decoder built from a latent-conditioned NeRF serves as the supervision signal to learn the latent space. An RL algorithm then operates on the learned latent space as its state representation. We call this NeRF-RL. Our experiments indicate that NeRF as supervision leads to a latent space better suited for the downstream RL tasks involving robotic object manipulations like hanging mugs on hooks, pushing objects, or opening doors. Video: https://dannydriess.github.io/nerf-rl
    UncertaINR: Uncertainty Quantification of End-to-End Implicit Neural Representations for Computed Tomography. (arXiv:2202.10847v2 [eess.IV] UPDATED)
    Implicit neural representations (INRs) have achieved impressive results for scene reconstruction and computer graphics, where their performance has primarily been assessed on reconstruction accuracy. As INRs make their way into other domains, where model predictions inform high-stakes decision-making, uncertainty quantification of INR inference is becoming critical. To that end, we study a Bayesian reformulation of INRs, UncertaINR, in the context of computed tomography, and evaluate several Bayesian deep learning implementations in terms of accuracy and calibration. We find that they achieve well-calibrated uncertainty, while retaining accuracy competitive with other classical, INR-based, and CNN-based reconstruction techniques. In contrast to the best-performing prior approaches, UncertaINR does not require a large training dataset, but only a handful of validation images.
    Central-Smoothing Hypergraph Neural Networks for Predicting Drug-Drug Interactions. (arXiv:2112.07837v3 [cs.LG] UPDATED)
    Predicting drug-drug interactions (DDI) is the problem of predicting side effects (unwanted outcomes) of a pair of drugs using drug information and known side effects of many pairs. This problem can be formulated as predicting labels (i.e. side effects) for each pair of nodes in a DDI graph, of which nodes are drugs and edges are interacting drugs with known labels. State-of-the-art methods for this problem are graph neural networks (GNNs), which leverage neighborhood information in the graph to learn node representations. For DDI, however, there are many labels with complicated relationships due to the nature of side effects. Usual GNNs often fix labels as one-hot vectors that do not reflect label relationships and potentially do not obtain the highest performance in the difficult cases of infrequent labels. In this paper, we formulate DDI as a hypergraph where each hyperedge is a triple: two nodes for drugs and one node for a label. We then present CentSmoothie, a hypergraph neural network that learns representations of nodes and labels altogether with a novel central-smoothing formulation. We empirically demonstrate the performance advantages of CentSmoothie in simulations as well as real datasets.
    Transformer-Based Self-Supervised Learning for Emotion Recognition. (arXiv:2204.05103v2 [q-bio.NC] CROSS LISTED)
    In order to exploit representations of time-series signals, such as physiological signals, it is essential that these representations capture relevant information from the whole signal. In this work, we propose to use a Transformer-based model to process electrocardiograms (ECG) for emotion recognition. Attention mechanisms of the Transformer can be used to build contextualized representations for a signal, giving more importance to relevant parts. These representations may then be processed with a fully-connected network to predict emotions. To overcome the relatively small size of datasets with emotional labels, we employ self-supervised learning. We gathered several ECG datasets with no labels of emotion to pre-train our model, which we then fine-tuned for emotion recognition on the AMIGOS dataset. We show that our approach reaches state-of-the-art performances for emotion recognition using ECG signals on AMIGOS. More generally, our experiments show that transformers and pre-training are promising strategies for emotion recognition with physiological signals.
    Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards. (arXiv:2206.01293v1 [cs.LG])
    Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent of interest.
    Completion Time Minimization of Fog-RAN-Assisted Federated Learning With Rate-Splitting Transmission. (arXiv:2206.01373v1 [eess.SP])
    This work studies federated learning (FL) over a fog radio access network, in which multiple internet-of-things (IoT) devices cooperatively learn a shared machine learning model by communicating with a cloud server (CS) through distributed access points (APs). Under the assumption that the fronthaul links connecting APs to CS have finite capacity, a rate-splitting transmission at IoT devices (IDs) is proposed which enables hybrid edge and cloud decoding of split uplink messages. The problem of completion time minimization for FL is tackled by optimizing the rate-splitting transmission and fronthaul quantization strategies along with training hyperparameters such as precision and iteration numbers. Numerical results show that the proposed rate-splitting transmission achieves notable gains over benchmark schemes which rely solely on edge or cloud decoding.
    A Survey on Surrogate-assisted Efficient Neural Architecture Search. (arXiv:2206.01520v1 [cs.LG])
    Neural architecture search (NAS) has become increasingly popular in the deep learning community recently, mainly because it can provide an opportunity to allow interested users without rich expertise to benefit from the success of deep neural networks (DNNs). However, NAS is still laborious and time-consuming because a large number of performance estimations are required during the search process of NAS, and training DNNs is computationally intensive. To solve the major limitation of NAS, improving the efficiency of NAS is essential in the design of NAS. This paper begins with a brief introduction to the general framework of NAS. Then, the methods for evaluating network candidates under the proxy metrics are systematically discussed. This is followed by a description of surrogate-assisted NAS, which is divided into three different categories, namely Bayesian optimization for NAS, surrogate-assisted evolutionary algorithms for NAS, and MOP for NAS. Finally, remaining challenges and open research questions are discussed, and promising research topics are suggested in this emerging field.
    Decentralized Training of Foundation Models in Heterogeneous Environments. (arXiv:2206.01288v1 [cs.DC])
    Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron).
    Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees. (arXiv:2206.01299v1 [cs.LG])
    Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AC-SGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve $O(1/\sqrt{T})$ convergence rate for non-convex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with non-linear activation functions.We then show that AC-SGD can be optimized and implemented efficiently, without additional end-to-end runtime overhead.We evaluated AC-SGD to fine-tune language models with up to 1.5 billion parameters, compressing activations to 2-4 bits.AC-SGD provides up to 4.3X end-to-end speed-up in slower networks, without sacrificing model quality. Moreover, we also show that AC-SGD can be combined with state-of-the-art gradient compression algorithms to enable "end-to-end communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision.This provides up to 4.9X end-to-end speed-up, without sacrificing model quality.
    Rethinking Class-Prior Estimation for Positive-Unlabeled Learning. (arXiv:2002.03673v2 [cs.LG] UPDATED)
    Given only positive (P) and unlabeled (U) data, PU learning can train a binary classifier without any negative data. It has two building blocks: PU class-prior estimation (CPE) and PU classification; the latter has been well studied while the former has received less attention. Hitherto, the distributional-assumption-free CPE methods rely on a critical assumption that the support of the positive data distribution cannot be contained in the support of the negative data distribution. If this is violated, those CPE methods will systematically overestimate the class prior; it is even worse that we cannot verify the assumption based on the data. In this paper, we rethink CPE for PU learning-can we remove the assumption to make CPE always valid? We show an affirmative answer by proposing Regrouping CPE (ReCPE) that builds an auxiliary probability distribution such that the support of the positive data distribution is never contained in the support of the negative data distribution. ReCPE can work with any CPE method by treating it as the base method. Theoretically, ReCPE does not affect its base if the assumption already holds for the original probability distribution; otherwise, it reduces the positive bias of its base. Empirically, ReCPE improves all state-of-the-art CPE methods on various datasets, implying that the assumption has indeed been violated here.
    Approximation of Images via Generalized Higher Order Singular Value Decomposition over Finite-dimensional Commutative Semisimple Algebra. (arXiv:2202.00450v7 [cs.LG] UPDATED)
    Low-rank approximation of images via singular value decomposition is well-received in the era of big data. However, singular value decomposition (SVD) is only for order-two data, i.e., matrices. It is necessary to flatten a higher order input into a matrix or break it into a series of order-two slices to tackle higher order data such as multispectral images and videos with the SVD. Higher order singular value decomposition (HOSVD) extends the SVD and can approximate higher order data using sums of a few rank-one components. We consider the problem of generalizing HOSVD over a finite dimensional commutative algebra. This algebra, referred to as a t-algebra, generalizes the field of complex numbers. The elements of the algebra, called t-scalars, are fix-sized arrays of complex numbers. One can generalize matrices and tensors over t-scalars and then extend many canonical matrix and tensor algorithms, including HOSVD, to obtain higher-performance versions. The generalization of HOSVD is called THOSVD. Its performance of approximating multi-way data can be further improved by an alternating algorithm. THOSVD also unifies a wide range of principal component analysis algorithms. To exploit the potential of generalized algorithms using t-scalars for approximating images, we use a pixel neighborhood strategy to convert each pixel to "deeper-order" t-scalar. Experiments on publicly available images show that the generalized algorithm over t-scalars, namely THOSVD, compares favorably with its canonical counterparts.
    Disentangling Epistemic and Aleatoric Uncertainty in Reinforcement Learning. (arXiv:2206.01558v1 [cs.LG])
    Characterizing aleatoric and epistemic uncertainty on the predicted rewards can help in building reliable reinforcement learning (RL) systems. Aleatoric uncertainty results from the irreducible environment stochasticity leading to inherently risky states and actions. Epistemic uncertainty results from the limited information accumulated during learning to make informed decisions. Characterizing aleatoric and epistemic uncertainty can be used to speed up learning in a training environment, improve generalization to similar testing environments, and flag unfamiliar behavior in anomalous testing environments. In this work, we introduce a framework for disentangling aleatoric and epistemic uncertainty in RL. (1) We first define four desiderata that capture the desired behavior for aleatoric and epistemic uncertainty estimation in RL at both training and testing time. (2) We then present four RL models inspired by supervised learning (i.e. Monte Carlo dropout, ensemble, deep kernel learning models, and evidential networks) to instantiate aleatoric and epistemic uncertainty. Finally, (3) we propose a practical evaluation method to evaluate uncertainty estimation in model-free RL based on detection of out-of-distribution environments and generalization to perturbed environments. We present theoretical and experimental evidence to validate that carefully equipping model-free RL agents with supervised learning uncertainty methods can fulfill our desiderata.
    Compositional Scene Representation Learning via Reconstruction: A Survey. (arXiv:2202.07135v2 [cs.LG] UPDATED)
    Visual scene representation learning is an important research problem in the field of computer vision. The performance of artificial intelligence systems on vision tasks could be improved if more suitable representations are learned for visual scenes. Complex visual scenes are composed of relatively simple visual concepts, and have the property of combinatorial explosion. Compared with directly representing the entire visual scene, extracting compositional scene representations can better cope with the diverse combinations of background and objects. Because compositional scene representations abstract the concept of objects, performing visual scene analysis and understanding based on these representations could be easier and more interpretable. Moreover, learning via reconstruction can greatly reduce the need for training data annotations. Therefore, reconstruction-based compositional scene representation learning has important research significance. In this survey, we first outline the current progress on this research topic, including development history and categorizations of existing methods from the perspectives of modeling of visual scenes and inference of scene representations; then provide benchmarks, including an open source toolbox to reproduce the benchmark experiments, of representative methods that consider the most extensively studied problem setting and form the foundation for other methods; and finally discuss the future directions of this research topic.
    Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. (arXiv:2201.12023v2 [cs.LG] UPDATED)
    Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa
    Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning. (arXiv:2206.01342v1 [cs.LG])
    While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, many theoretical works proposed to understand SSL still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We theoretically demonstrate that (1) the presence of nonlinearity leads to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned; and (2) nonlinearity leads to specialized weights into diverse patterns, a behavior that linear activation is proven not capable of. These findings suggest that models with lots of parameters can be regarded as a \emph{brute-force} way to find these local optima induced by nonlinearity, a possible underlying reason why empirical observations such as the lottery ticket hypothesis hold. In addition, for 2-layer setting, we also discover \emph{global modulation}: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.
    OntoProtein: Protein Pretraining With Gene Ontology Embedding. (arXiv:2201.11147v6 [q-bio.BM] UPDATED)
    Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.
    XPASC: Measuring Generalization in Weak Supervision. (arXiv:2206.01444v1 [cs.LG])
    Weak supervision is leveraged in a wide range of domains and tasks due to its ability to create massive amounts of labeled data, requiring only little manual effort. Standard approaches use labeling functions to specify signals that are relevant for the labeling. It has been conjectured that weakly supervised models over-rely on those signals and as a result suffer from overfitting. To verify this assumption, we introduce a novel method, XPASC (eXPlainability-Association SCore), for measuring the generalization of a model trained with a weakly supervised dataset. Considering the occurrences of features, classes and labeling functions in a dataset, XPASC takes into account the relevance of each feature for the predictions of the model as well as the associations of the feature with the class and the labeling function, respectively. The association in XPASC can be measured in two variants: XPASC-CHI SQAURE measures associations relative to their statistical significance, while XPASC-PPMI measures association strength more generally. We use XPASC to analyze KnowMAN, an adversarial architecture intended to control the degree of generalization from the labeling functions and thus to mitigate the problem of overfitting. On one hand, we show that KnowMAN is able to control the degree of generalization through a hyperparameter. On the other hand, results and qualitative analysis show that generalization and performance do not relate one-to-one, and that the highest degree of generalization does not necessarily imply the best performance. Therefore methods that allow for controlling the amount of generalization can achieve the right degree of benign overfitting. Our contributions in this study are i) the XPASC score to measure generalization in weakly-supervised models, ii) evaluation of XPASC across datasets and models and iii) the release of the XPASC implementation.
    Equipping Black-Box Policies with Model-Based Advice for Stable Nonlinear Control. (arXiv:2206.01341v1 [cs.LG])
    Machine-learned black-box policies are ubiquitous for nonlinear control problems. Meanwhile, crude model information is often available for these problems from, e.g., linear approximations of nonlinear dynamics. We study the problem of equipping a black-box control policy with model-based advice for nonlinear control on a single trajectory. We first show a general negative result that a naive convex combination of a black-box policy and a linear model-based policy can lead to instability, even if the two policies are both stabilizing. We then propose an adaptive $\lambda$-confident policy, with a coefficient $\lambda$ indicating the confidence in a black-box policy, and prove its stability. With bounded nonlinearity, in addition, we show that the adaptive $\lambda$-confident policy achieves a bounded competitive ratio when a black-box policy is near-optimal. Finally, we propose an online learning approach to implement the adaptive $\lambda$-confident policy and verify its efficacy in case studies about the CartPole problem and a real-world electric vehicle (EV) charging problem with data bias due to COVID-19.
    Modeling electronic health record data using a knowledge-graph-embedded topic model. (arXiv:2206.01436v1 [cs.LG])
    The rapid growth of electronic health record (EHR) datasets opens up promising opportunities to understand human diseases in a systematic way. However, effective extraction of clinical knowledge from the EHR data has been hindered by its sparsity and noisy information. We present KG-ETM, an end-to-end knowledge graph-based multimodal embedded topic model. KG-ETM distills latent disease topics from EHR data by learning the embedding from the medical knowledge graphs. We applied KG-ETM to a large-scale EHR dataset consisting of over 1 million patients. We evaluated its performance based on EHR reconstruction and drug imputation. KG-ETM demonstrated superior performance over the alternative methods on both tasks. Moreover, our model learned clinically meaningful graph-informed embedding of the EHR codes. In additional, our model is also able to discover interpretable and accurate patient representations for patient stratification and drug recommendations.
    Regularization-wise double descent: Why it occurs and how to eliminate it. (arXiv:2206.01378v1 [cs.LG])
    The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.
    A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features. (arXiv:2206.01717v1 [cs.LG])
    An important characteristic of neural networks is their ability to learn representations of the input data with effective features for prediction, which is believed to be a key factor to their superior empirical performance. To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution). In contrast, no linear models on data-independent features of polynomial sizes can learn to as good errors. Furthermore, if the specific input structure is removed, then no polynomial algorithm in the Statistical Query model can learn even weakly. These results provide theoretical evidence showing that feature learning in neural networks depends strongly on the input structure and leads to the superior performance. Our preliminary experimental results on synthetic and real data also provide positive support.
    MetaLR: Layer-wise Learning Rate based on Meta-Learning for Adaptively Fine-tuning Medical Pre-trained Models. (arXiv:2206.01408v1 [cs.CV])
    When applying transfer learning for medical image analysis, downstream tasks often have significant gaps with the pre-training tasks. Previous methods mainly focus on improving the transferabilities of the pre-trained models to bridge the gaps. In fact, model fine-tuning can also play a very important role in tackling this problem. A conventional fine-tuning method is updating all deep neural networks (DNNs) layers by a single learning rate (LR), which ignores the unique transferabilities of different layers. In this work, we explore the behaviors of different layers in the fine-tuning stage. More precisely, we first hypothesize that lower-level layers are more domain-specific while higher-level layers are more task-specific, which is verified by a simple bi-directional fine-tuning scheme. It is harder for the pre-trained specific layers to transfer to new tasks than general layers. On this basis, to make different layers better co-adapt to the downstream tasks according to their transferabilities, a meta-learning-based LR learner, namely MetaLR, is proposed to assign LRs for each layer automatically. Extensive experiments on various medical applications (i.e., POCUS, BUSI, Chest X-ray, and LiTS) well confirm our hypothesis and show the superior performance of the proposed methods to previous state-of-the-art fine-tuning methods.
    Detecting Pulmonary Embolism from Computed Tomography Using Convolutional Neural Network. (arXiv:2206.01344v1 [eess.IV])
    The clinical symptoms of pulmonary embolism (PE) are very diverse and non-specific, which makes it difficult to diagnose. In addition, pulmonary embolism has multiple triggers and is one of the major causes of vascular death. Therefore, if it can be detected and treated quickly, it can significantly reduce the risk of death in hospitalized patients. In the detection process, the cost of computed tomography pulmonary angiography (CTPA) is high, and angiography requires the injection of contrast agents, which increase the risk of damage to the patient. Therefore, this study will use a deep learning approach to detect pulmonary embolism in all patients who take a CT image of the chest using a convolutional neural network. With the proposed pulmonary embolism detection system, we can detect the possibility of pulmonary embolism at the same time as the patient's first CT image, and schedule the CTPA test immediately, saving more than a week of CT image screening time and providing timely diagnosis and treatment to the patient.
    CodedPaddedFL and CodedSecAgg: Straggler Mitigation and Secure Aggregation in Federated Learning. (arXiv:2112.08909v2 [cs.LG] UPDATED)
    We present two novel federated learning (FL) schemes that mitigate the effect of straggling devices by introducing redundancy on the devices' data across the network. Compared to other schemes in the literature, which deal with stragglers or device dropouts by ignoring their contribution, the proposed schemes do not suffer from the client drift problem. The first scheme, CodedPaddedFL, mitigates the effect of stragglers while retaining the privacy level of conventional FL. It combines one-time padding for user data privacy with gradient codes to yield straggler resiliency. The second scheme, CodedSecAgg, provides straggler resiliency and robustness against model inversion attacks and is based on Shamir's secret sharing. We apply CodedPaddedFL and CodedSecAgg to a classification problem. For a scenario with 120 devices, CodedPaddedFL achieves a speed-up factor of 18 for an accuracy of 95% on the MNIST dataset compared to conventional FL. Furthermore, it yields similar performance in terms of latency compared to a recently proposed scheme by Prakash et al. without the shortcoming of additional leakage of private data. CodedSecAgg outperforms the state-of-the-art secure aggregation scheme LightSecAgg by a speed-up factor of 6.6-18.7 for the MNIST dataset for an accuracy of 95%.
    A New Security Boundary of Component Differentially Challenged XOR PUFs Against Machine Learning Modeling Attacks. (arXiv:2206.01314v1 [cs.CR])
    Physical Unclonable Functions (PUFs) are promising security primitives for resource-constrained network nodes. The XOR Arbiter PUF (XOR PUF or XPUF) is an intensively studied PUF invented to improve the security of the Arbiter PUF, probably the most lightweight delay-based PUF. Recently, highly powerful machine learning attack methods were discovered and were able to easily break large-sized XPUFs, which were highly secure against earlier machine learning attack methods. Component-differentially-challenged XPUFs (CDC-XPUFs) are XPUFs with different component PUFs receiving different challenges. Studies showed they were much more secure against machine learning attacks than the conventional XPUFs, whose component PUFs receive the same challenge. But these studies were all based on earlier machine learning attack methods, and hence it is not clear if CDC-XPUFs can remain secure under the recently discovered powerful attack methods. In this paper, the two current most powerful two machine learning methods for attacking XPUFs are adapted by fine-tuning the parameters of the two methods for CDC-XPUFs. Attack experiments using both simulated PUF data and silicon data generated from PUFs implemented on field-programmable gate array (FPGA) were carried out, and the experimental results showed that some previously secure CDC-XPUFs of certain circuit parameter values are no longer secure under the adapted new attack methods, while many more CDC-XPUFs of other circuit parameter values remain secure. Thus, our experimental attack study has re-defined the boundary between the secure region and the insecure region of the PUF circuit parameter space, providing PUF manufacturers and IoT security application developers with valuable information in choosing PUFs with secure parameter values.
    MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data. (arXiv:2206.01347v1 [cs.AI])
    Numerical reasoning over hybrid data containing both textual and tabular content (e.g., financial reports) has recently attracted much attention in the NLP community. However, existing question answering (QA) benchmarks over hybrid data only include a single flat table in each document and thus lack examples of multi-step numerical reasoning across multiple hierarchical tables. To facilitate data analytical progress, we construct a new large-scale benchmark, MultiHiertt, with QA pairs over Multi Hierarchical Tabular and Textual data. MultiHiertt is built from a wealth of financial reports and has the following unique characteristics: 1) each document contain multiple tables and longer unstructured texts; 2) most of tables contained are hierarchical; 3) the reasoning process required for each question is more complex and challenging than existing benchmarks; and 4) fine-grained annotations of reasoning processes and supporting facts are provided to reveal complex numerical reasoning. We further introduce a novel QA model termed MT2Net, which first applies facts retrieving to extract relevant supporting facts from both tables and text and then uses a reasoning module to perform symbolic reasoning over retrieved facts. We conduct comprehensive experiments on various baselines. The experimental results show that MultiHiertt presents a strong challenge for existing baselines whose results lag far behind the performance of human experts. The dataset and code are publicly available at https://github.com/psunlpgroup/MultiHiertt.
    Fair Classification via Transformer Neural Networks: Case Study of an Educational Domain. (arXiv:2206.01410v1 [cs.LG])
    Educational technologies nowadays increasingly use data and Machine Learning (ML) models. This gives the students, instructors, and administrators support and insights for the optimum policy. However, it is well acknowledged that ML models are subject to bias, which raises concern about the fairness, bias, and discrimination of using these automated ML algorithms in education and its unintended and unforeseen negative consequences. The contribution of bias during the decision-making comes from datasets used for training ML models and the model architecture. This paper presents a preliminary investigation of fairness constraint in transformer neural networks on Law School and Student-Mathematics datasets. The used transformer models transform these raw datasets into a richer representation space of natural language processing (NLP) while solving fairness classification. We have employed fairness metrics for evaluation and check the trade-off between fairness and accuracy. We have reported the various metrics of F1, SPD, EOD, and accuracy for different architectures from the transformer model class.
    Global Self-Attention as a Replacement for Graph Convolution. (arXiv:2108.03348v3 [cs.LG] UPDATED)
    We propose an extension to the transformer neural network architecture for general-purpose graph learning by adding a dedicated pathway for pairwise structural information, called edge channels. The resultant framework - which we call Edge-augmented Graph Transformer (EGT) - can directly accept, process and output structural information of arbitrary form, which is important for effective learning on graph-structured data. Our model exclusively uses global self-attention as an aggregation mechanism rather than static localized convolutional aggregation. This allows for unconstrained long-range dynamic interactions between nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. We verify the performance of EGT in a wide range of graph-learning experiments on benchmark datasets, in which it outperforms Convolutional/Message-Passing Graph Neural Networks. EGT sets a new state-of-the-art for the quantum-chemical regression task on the OGB-LSC PCQM4Mv2 dataset containing 3.8 million molecular graphs. Our findings indicate that global self-attention based aggregation can serve as a flexible, adaptive and effective replacement of graph convolution for general-purpose graph learning. Therefore, convolutional local neighborhood aggregation is not an essential inductive bias.
    Learning a Restricted Boltzmann Machine using biased Monte Carlo sampling. (arXiv:2206.01310v1 [cs.LG])
    Restricted Boltzmann Machines are simple and powerful generative models capable of encoding any complex dataset. Despite all their advantages, in practice, trainings are often unstable, and it is hard to assess their quality because dynamics are hampered by extremely slow time-dependencies. This situation becomes critical when dealing with low-dimensional clustered datasets, where the time needed to sample ergodically the trained models becomes computationally prohibitive. In this work, we show that this divergence of Monte Carlo mixing times is related to a phase coexistence phenomenon, similar to that encountered in Physics in the vicinity of a first order phase transition. We show that sampling the equilibrium distribution via Markov Chain Monte Carlo can be dramatically accelerated using biased sampling techniques, in particular, the Tethered Monte Carlo method (TMC). This sampling technique solves efficiently the problem of evaluating the quality of a given trained model and the generation of new samples in reasonable times. In addition, we show that this sampling technique can be exploited to improve the computation of the log-likelihood gradient during the training too, which produces dramatic improvements when training RBMs with artificial clustered datasets. When dealing with real low-dimensional datasets, this new training procedure fits RBM models with significantly faster relaxational dynamics than those obtained with standard PCD recipes. We also show that TMC sampling can be used to recover free-energy profile of the RBM, which turns out to be extremely useful to compute the probability distribution of a given model and to improve the generation of new decorrelated samples on slow PCD trained models.
    Sample-Efficient Reinforcement Learning of Partially Observable Markov Games. (arXiv:2206.01315v1 [cs.LG])
    This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.
    The geometry of integration in text classification RNNs. (arXiv:2010.15114v2 [cs.LG] UPDATED)
    Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
    Distributional Reinforcement Learning with Unconstrained Monotonic Neural Networks. (arXiv:2106.03228v2 [cs.LG] UPDATED)
    The distributional reinforcement learning (RL) approach advocates for representing the complete probability distribution of the random return instead of only modelling its expectation. A distributional RL algorithm may be characterised by two main components, namely the representation of the distribution together with its parameterisation and the probability metric defining the loss. The present research work considers the unconstrained monotonic neural network (UMNN) architecture, a universal approximator of continuous monotonic functions which is particularly well suited for modelling different representations of a distribution (PDF, CDF, QF). This property enables the efficient decoupling of the effect of the function approximator class from that of the probability metric. The research paper firstly introduces a methodology for learning different representations of the random return distribution. Secondly, a novel distributional RL algorithm named unconstrained monotonic deep Q-network (UMDQN) is presented. Lastly, in light of this new algorithm, an empirical comparison is performed between three probability quasimetrics, namely the Kullback-Leibler divergence, Cramer distance, and Wasserstein distance. The results highlight the main strengths and weaknesses associated with each probability metric together with an important limitation of the Wasserstein distance. This research concludes by calling for a reconsideration of all probability metrics in distributional RL, contrasting with the clear dominance of the Wasserstein distance in recent publications.
    SPD domain-specific batch normalization to crack interpretable unsupervised domain adaptation in EEG. (arXiv:2206.01323v1 [cs.LG])
    Electroencephalography (EEG) provides access to neuronal dynamics non-invasively with millisecond resolution, rendering it a viable method in neuroscience and healthcare. However, its utility is limited as current EEG technology does not generalize well across domains (i.e., sessions and subjects) without expensive supervised re-calibration. Contemporary methods cast this transfer learning (TL) problem as a multi-source/-target unsupervised domain adaptation (UDA) problem and address it with deep learning or shallow, Riemannian geometry aware alignment methods. Both directions have, so far, failed to consistently close the performance gap to state-of-the-art domain-specific methods based on tangent space mapping (TSM) on the symmetric positive definite (SPD) manifold. Here, we propose a theory-based machine learning framework that enables, for the first time, learning domain-invariant TSM models in an end-to-end fashion. To achieve this, we propose a new building block for geometric deep learning, which we denote SPD domain-specific momentum batch normalization (SPDDSMBN). A SPDDSMBN layer can transform domain-specific SPD inputs into domain-invariant SPD outputs, and can be readily applied to multi-source/-target and online UDA scenarios. In extensive experiments with 6 diverse EEG brain-computer interface (BCI) datasets, we obtain state-of-the-art performance in inter-session and -subject TL with a simple, intrinsically interpretable network architecture, which we denote TSMNet.
    Hybrid Models for Mixed Variables in Bayesian Optimization. (arXiv:2206.01409v1 [cs.LG])
    We systematically describe the problem of simultaneous surrogate modeling of mixed variables (i.e., continuous, integer and categorical variables) in the Bayesian optimization (BO) context. We provide a unified hybrid model using both Monte-Carlo tree search (MCTS) and Gaussian processes (GP) that encompasses and generalizes multiple state-of-the-art mixed BO surrogates. Based on the architecture, we propose applying a new dynamic model selection criterion among novel candidate families of covariance kernels, including non-stationary kernels and associated families. Different benchmark problems are studied and presented to support the superiority of our model, along with results highlighting the effectiveness of our method compared to most state-of-the-art mixed-variable methods in BO.
    Sequential Permutation Testing of Random Forest Variable Importance Measures. (arXiv:2206.01284v1 [stat.ME])
    Hypothesis testing of random forest (RF) variable importance measures (VIMP) remains the subject of ongoing research. Among recent developments, heuristic approaches to parametric testing have been proposed whose distributional assumptions are based on empirical evidence. Other formal tests under regularity conditions were derived analytically. However, these approaches can be computationally expensive or even practically infeasible. This problem also occurs with non-parametric permutation tests, which are, however, distribution-free and can generically be applied to any type of RF and VIMP. Embracing this advantage, it is proposed here to use sequential permutation tests and sequential p-value estimation to reduce the high computational costs associated with conventional permutation tests. The popular and widely used permutation VIMP serves as a practical and relevant application example. The results of simulation studies confirm that the theoretical properties of the sequential tests apply, that is, the type-I error probability is controlled at a nominal level and a high power is maintained with considerably fewer permutations needed in comparison to conventional permutation testing. The numerical stability of the methods is investigated in two additional application studies. In summary, theoretically sound sequential permutation testing of VIMP is possible at greatly reduced computational costs. Recommendations for application are given. A respective implementation is provided through the accompanying R package $rfvimptest$. The approach can also be easily applied to any kind of prediction model.
    Incremental Learning Meets Transfer Learning: Application to Multi-site Prostate MRI Segmentation. (arXiv:2206.01369v1 [cs.CV])
    Many medical datasets have recently been created for medical image segmentation tasks, and it is natural to question whether we can use them to sequentially train a single model that (1) performs better on all these datasets, and (2) generalizes well and transfers better to the unknown target site domain. Prior works have achieved this goal by jointly training one model on multi-site datasets, which achieve competitive performance on average but such methods rely on the assumption about the availability of all training data, thus limiting its effectiveness in practical deployment. In this paper, we propose a novel multi-site segmentation framework called incremental-transfer learning (ITL), which learns a model from multi-site datasets in an end-to-end sequential fashion. Specifically, "incremental" refers to training sequentially constructed datasets, and "transfer" is achieved by leveraging useful information from the linear combination of embedding features on each dataset. In addition, we introduce our ITL framework, where we train the network including a site-agnostic encoder with pre-trained weights and at most two segmentation decoder heads. We also design a novel site-level incremental loss in order to generalize well on the target domain. Second, we show for the first time that leveraging our ITL training scheme is able to alleviate challenging catastrophic forgetting problems in incremental learning. We conduct experiments using five challenging benchmark datasets to validate the effectiveness of our incremental-transfer learning approach. Our approach makes minimal assumptions on computation resources and domain-specific expertise, and hence constitutes a strong starting point in multi-site medical image segmentation.
    GASP, a generalized framework for agglomerative clustering of signed graphs and its application to Instance Segmentation. (arXiv:1906.11713v2 [cs.CV] UPDATED)
    We propose a theoretical framework that generalizes simple and fast algorithms for hierarchical agglomerative clustering to weighted graphs with both attractive and repulsive interactions between the nodes. This framework defines GASP, a Generalized Algorithm for Signed graph Partitioning, and allows us to explore many combinations of different linkage criteria and cannot-link constraints. We prove the equivalence of existing clustering methods to some of those combinations and introduce new algorithms for combinations that have not been studied before. We study both theoretical and empirical properties of these combinations and prove that some of these define an ultrametric on the graph. We conduct a systematic comparison of various instantiations of GASP on a large variety of both synthetic and existing signed clustering problems, in terms of accuracy but also efficiency and robustness to noise. Lastly, we show that some of the algorithms included in our framework, when combined with the predictions from a CNN model, result in a simple bottom-up instance segmentation pipeline. Going all the way from pixels to final segments with a simple procedure, we achieve state-of-the-art accuracy on the CREMI 2016 EM segmentation benchmark without requiring domain-specific superpixels.
    Equivariant Reinforcement Learning for Quadrotor UAV. (arXiv:2206.01233v1 [cs.LG])
    This paper presents an equivariant reinforcement learning framework for quadrotor unmanned aerial vehicles. Successful training of reinforcement learning often requires numerous interactions with the environments, which hinders its applicability especially when the available computational resources are limited, or when there is no reliable simulation model. We identified an equivariance property of the quadrotor dynamics such that the dimension of the state required in the training is reduced by one, thereby improving the sampling efficiency of reinforcement learning substantially. This is illustrated by numerical examples with popular reinforcement learning techniques of TD3 and SAC.
    Continuous Control with Action Quantization from Demonstrations. (arXiv:2110.10149v2 [cs.LG] UPDATED)
    In this paper, we propose a novel Reinforcement Learning (RL) framework for problems with continuous action spaces: Action Quantization from Demonstrations (AQuaDem). The proposed approach consists in learning a discretization of continuous action spaces from human demonstrations. This discretization returns a set of plausible actions (in light of the demonstrations) for each input state, thus capturing the priors of the demonstrator and their multimodal behavior. By discretizing the action space, any discrete action deep RL technique can be readily applied to the continuous control problem. Experiments show that the proposed approach outperforms state-of-the-art methods such as SAC in the RL setup, and GAIL in the Imitation Learning setup. We provide a website with interactive videos: https://google-research.github.io/aquadem/ and make the code available: https://github.com/google-research/google-research/tree/master/aquadem.
    Multiband VAE: Latent Space Alignment for Knowledge Consolidation in Continual Learning. (arXiv:2106.12196v2 [cs.LG] UPDATED)
    We propose a new method for unsupervised generative continual learning through realignment of Variational Autoencoder's latent space. Deep generative models suffer from catastrophic forgetting in the same way as other neural structures. Recent generative continual learning works approach this problem and try to learn from new data without forgetting previous knowledge. However, those methods usually focus on artificial scenarios where examples share almost no similarity between subsequent portions of data - an assumption not realistic in the real-life applications of continual learning. In this work, we identify this limitation and posit the goal of generative continual learning as a knowledge accumulation task. We solve it by continuously aligning latent representations of new data that we call bands in additional latent space where examples are encoded independently of their source task. In addition, we introduce a method for controlled forgetting of past data that simplifies this process. On top of the standard continual learning benchmarks, we propose a novel challenging knowledge consolidation scenario and show that the proposed approach outperforms state-of-the-art by up to twofold across all experiments and the additional real-life evaluation. To our knowledge, Multiband VAE is the first method to show forward and backward knowledge transfer in generative continual learning.
    Towards Evading the Limits of Randomized Smoothing: A Theoretical Analysis. (arXiv:2206.01715v1 [cs.LG])
    Randomized smoothing is the dominant standard for provable defenses against adversarial examples. Nevertheless, this method has recently been proven to suffer from important information theoretic limitations. In this paper, we argue that these limitations are not intrinsic, but merely a byproduct of current certification methods. We first show that these certificates use too little information about the classifier, and are in particular blind to the local curvature of the decision boundary. This leads to severely sub-optimal robustness guarantees as the dimension of the problem increases. We then show that it is theoretically possible to bypass this issue by collecting more information about the classifier. More precisely, we show that it is possible to approximate the optimal certificate with arbitrary precision, by probing the decision boundary with several noise distributions. Since this process is executed at certification time rather than at test time, it entails no loss in natural accuracy while enhancing the quality of the certificates. This result fosters further research on classifier-specific certification and demonstrates that randomized smoothing is still worth investigating. Although classifier-specific certification may induce more computational cost, we also provide some theoretical insight on how to mitigate it.
    Scalar is Not Enough: Vectorization-based Unbiased Learning to Rank. (arXiv:2206.01702v1 [cs.IR])
    Unbiased learning to rank (ULTR) aims to train an unbiased ranking model from biased user click logs. Most of the current ULTR methods are based on the examination hypothesis (EH), which assumes that the click probability can be factorized into two scalar functions, one related to ranking features and the other related to bias factors. Unfortunately, the interactions among features, bias factors and clicks are complicated in practice, and usually cannot be factorized in this independent way. Fitting click data with EH could lead to model misspecification and bring the approximation error. In this paper, we propose a vector-based EH and formulate the click probability as a dot product of two vector functions. This solution is complete due to its universality in fitting arbitrary click functions. Based on it, we propose a novel model named Vectorization to adaptively learn the relevance embeddings and sort documents by projecting embeddings onto a base vector. Extensive experiments show that our method significantly outperforms the state-of-the-art ULTR methods on complex real clicks as well as simple simulated clicks.
    HEX: Human-in-the-loop Explainability via Deep Reinforcement Learning. (arXiv:2206.01343v1 [cs.LG])
    The use of machine learning (ML) models in decision-making contexts, particularly those used in high-stakes decision-making, are fraught with issue and peril since a person - not a machine - must ultimately be held accountable for the consequences of the decisions made using such systems. Machine learning explainability (MLX) promises to provide decision-makers with prediction-specific rationale, assuring them that the model-elicited predictions are made for the right reasons and are thus reliable. Few works explicitly consider this key human-in-the-loop (HITL) component, however. In this work we propose HEX, a human-in-the-loop deep reinforcement learning approach to MLX. HEX incorporates 0-distrust projection to synthesize decider specific explanation-providing policies from any arbitrary classification model. HEX is also constructed to operate in limited or reduced training data scenarios, such as those employing federated learning. Our formulation explicitly considers the decision boundary of the ML model in question, rather than the underlying training data, which is a shortcoming of many model-agnostic MLX methods. Our proposed methods thus synthesize HITL MLX policies that explicitly capture the decision boundary of the model in question for use in limited data scenarios.
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v2 [cs.LG] UPDATED)
    Although Thompson sampling is widely used in stationary environments, it does not effectively account for nonstationarities. To address this limitation, we propose predictive sampling, a policy that balances between exploration and exploitation in nonstationary bandit environments. It is equivalent to Thompson sampling when specialized to stationary environments, but much more effective across a range of nonstationary environments because it deprioritizes investment in acquiring information that will quickly lose relevance. To offer insight in the efficacy of predictive sampling, we establish a regret bound. This bound highlights dependence on the rate at which new information arrives to alter the environment. In addition, we conduct experiments on bandit environments with varying rates of information arrival and observe that predictive sampling outperforms Thompson sampling.
    PNODE: A memory-efficient neural ODE framework based on high-level adjoint differentiation. (arXiv:2206.01298v1 [cs.LG])
    Neural ordinary differential equations (neural ODEs) have emerged as a novel network architecture that bridges dynamical systems and deep learning. However, the gradient obtained with the continuous adjoint method in the vanilla neural ODE is not reverse-accurate. Other approaches suffer either from excessive memory requirement due to deep computational graphs or from limited choices for the time integration scheme, hampering their application to large-scale complex dynamical systems. To achieve accurate gradients without compromising memory efficiency and flexibility, we present a new neural ODE framework, PNODE, based on high-level discrete adjoint algorithmic differentiation. By leveraging discrete adjoint time integrators and advanced checkpointing strategies tailored for these integrators, PNODE can provide a balance between memory and computational costs, while computing the gradients consistently and accurately. We provide an open-source implementation based on PyTorch and PETSc, one of the most commonly used portable, scalable scientific computing libraries. We demonstrate the performance through extensive numerical experiments on image classification and continuous normalizing flow problems. We show that PNODE achieves the highest memory efficiency when compared with other reverse-accurate methods. On the image classification problems, PNODE is up to two times faster than the vanilla neural ODE and up to 2.3 times faster than the best existing reverse-accurate method. We also show that PNODE enables the use of the implicit time integration methods that are needed for stiff dynamical systems.
    One-Bit Matrix Completion with Differential Privacy. (arXiv:2110.00719v3 [cs.CR] UPDATED)
    As a prevailing collaborative filtering method for recommendation systems, one-bit matrix completion requires data collected by users to provide personalized service. Due to insidious attacks and unexpected inference, the release of users' data often raises serious privacy concerns. To address this issue, differential privacy(DP) has been widely used in standard matrix completion models. To date, however, little has been known about how to apply DP to achieve privacy protection in one-bit matrix completion. In this paper, we propose a unified framework for ensuring a strong privacy guarantee of one-bit matrix completion with DP. In our framework, we develop four different private perturbation mechanisms corresponding to different stages of one-bit matrix completion. For each mechanism, we design a privacy-preserving algorithm and provide a theoretical recovery error bound under the proper conditions. Numerical experiments on synthetic and real-world datasets demonstrate the effectiveness of our proposal. Compared to the one-bit matrix completion without privacy protection, our proposed mechanisms can maintain high-level privacy protection with marginal loss of completion accuracy.
    Understanding Deep Contrastive Learning via Coordinate-wise Optimization. (arXiv:2201.12680v4 [cs.LG] UPDATED)
    We show that Contrastive Learning (CL) under a broad family of loss functions (including InfoNCE) has a unified formulation of coordinate-wise optimization on the network parameter $\boldsymbol{\theta}$ and pairwise importance $\alpha$, where the \emph{max player} $\boldsymbol{\theta}$ learns representation for contrastiveness, and the \emph{min player} $\alpha$ puts more weights on pairs of distinct samples that share similar representations. The resulting formulation, called $\alpha$-CL, unifies not only various existing contrastive losses, which differ by how sample-pair importance $\alpha$ is constructed, but also is able to extrapolate to give novel contrastive losses beyond popular ones, opening a new avenue of contrastive loss design. These novel losses yield comparable (or better) performance on CIFAR10 and STL-10 than classic InfoNCE. Furthermore, we also analyze the max player in detail: we prove that with fixed $\alpha$, max player is equivalent to Principal Component Analysis (PCA) for deep linear network, and almost all local minima are global and rank-1, recovering optimal PCA solutions. Finally, we extend our analysis on max player to 2-layer ReLU networks, showing that its fixed points can have higher ranks.
    Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code. (arXiv:2206.01335v1 [cs.SE])
    Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises the question whether they could serve as a basis for building a wide range code generation tools. Traditionally, such tools are built manually and separately for each task. Instead, few-shot learning may allow to obtain different tools from a single pre-trained language model by simply providing a few examples or a natural language description of the expected tool behavior. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose. We consider three code manipulation and code generation tasks targeted by a range of traditional tools: (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation. For each task, we compare few-shot learning to a manually built tool. Our results show that the model-based tools complement (code mutation), are on par (test oracle generation), or even outperform their respective traditionally built tool (test case generation), while imposing far less effort to develop them. By comparing the effectiveness of different variants of the model-based tools, we provide insights on how to design an appropriate input ("prompt") to the model and what influence the size of the model has. For example, we find that providing a small natural language description of the code generation task is an easy way to improve predictions. Overall, we conclude that few-shot language models are surprisingly effective, yet there is still more work to be done, such as exploring more diverse ways of prompting and tackling even more involved tasks.
    KCRL: Krasovskii-Constrained Reinforcement Learning with Guaranteed Stability in Nonlinear Dynamical Systems. (arXiv:2206.01704v1 [cs.LG])
    Learning a dynamical system requires stabilizing the unknown dynamics to avoid state blow-ups. However, current reinforcement learning (RL) methods lack stabilization guarantees, which limits their applicability for the control of safety-critical systems. We propose a model-based RL framework with formal stability guarantees, Krasovskii Constrained RL (KCRL), that adopts Krasovskii's family of Lyapunov functions as a stability constraint. The proposed method learns the system dynamics up to a confidence interval using feature representation, e.g. Random Fourier Features. It then solves a constrained policy optimization problem with a stability constraint based on Krasovskii's method using a primal-dual approach to recover a stabilizing policy. We show that KCRL is guaranteed to learn a stabilizing policy in a finite number of interactions with the underlying unknown system. We also derive the sample complexity upper bound for stabilization of unknown nonlinear dynamical systems via the KCRL framework.
    Differentially Private Multivariate Time Series Forecasting of Aggregated Human Mobility With Deep Learning: Input or Gradient Perturbation?. (arXiv:2205.00436v2 [cs.LG] UPDATED)
    This paper investigates the problem of forecasting multivariate aggregated human mobility while preserving the privacy of the individuals concerned. Differential privacy, a state-of-the-art formal notion, has been used as the privacy guarantee in two different and independent steps when training deep learning models. On one hand, we considered \textit{gradient perturbation}, which uses the differentially private stochastic gradient descent algorithm to guarantee the privacy of each time series sample in the learning stage. On the other hand, we considered \textit{input perturbation}, which adds differential privacy guarantees in each sample of the series before applying any learning. We compared four state-of-the-art recurrent neural networks: Long Short-Term Memory, Gated Recurrent Unit, and their Bidirectional architectures, i.e., Bidirectional-LSTM and Bidirectional-GRU. Extensive experiments were conducted with a real-world multivariate mobility dataset, which we published openly along with this paper. As shown in the results, differentially private deep learning models trained under gradient or input perturbation achieve nearly the same performance as non-private deep learning models, with loss in performance varying between $0.57\%$ to $2.8\%$. The contribution of this paper is significant for those involved in urban planning and decision-making, providing a solution to the human mobility multivariate forecast problem through differentially private deep learning models.
    Impact of the composition of feature extraction and class sampling in medicare fraud detection. (arXiv:2206.01413v1 [cs.LG])
    With healthcare being critical aspect, health insurance has become an important scheme in minimizing medical expenses. Following this, the healthcare industry has seen a significant increase in fraudulent activities owing to increased insurance, and fraud has become a significant contributor to rising medical care expenses, although its impact can be mitigated using fraud detection techniques. To detect fraud, machine learning techniques are used. The Centers for Medicaid and Medicare Services (CMS) of the United States federal government released "Medicare Part D" insurance claims is utilized in this study to develop fraud detection system. Employing machine learning algorithms on a class-imbalanced and high dimensional medicare dataset is a challenging task. To compact such challenges, the present work aims to perform feature extraction following data sampling, afterward applying various classification algorithms, to get better performance. Feature extraction is a dimensionality reduction approach that converts attributes into linear or non-linear combinations of the actual attributes, generating a smaller and more diversified set of attributes and thus reducing the dimensions. Data sampling is commonlya used to address the class imbalance either by expanding the frequency of minority class or reducing the frequency of majority class to obtain approximately equal numbers of occurrences for both classes. The proposed approach is evaluated through standard performance metrics. Thus, to detect fraud efficiently, this study applies autoencoder as a feature extraction technique, synthetic minority oversampling technique (SMOTE) as a data sampling technique, and various gradient boosted decision tree-based classifiers as a classification algorithm. The experimental results show the combination of autoencoders followed by SMOTE on the LightGBM classifier achieved best results.
    BaCaDI: Bayesian Causal Discovery with Unknown Interventions. (arXiv:2206.01665v1 [cs.LG])
    Learning causal structures from observation and experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, a key challenge is that often the targets of the interventions are uncertain or unknown. Thus, standard causal discovery methods can no longer be used. To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering the causal structure that underlies data generated under various unknown experimental/interventional conditions. BaCaDI is fully differentiable and operates in the continuous space of latent probabilistic representations of both causal structures and interventions. This enables us to approximate complex posteriors via gradient-based variational inference and to reason about the epistemic uncertainty in the predicted structure. In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets. Finally, we demonstrate that, thanks to its rigorous Bayesian approach, our method provides well-calibrated uncertainty estimates.
    Offline Reinforcement Learning with Causal Structured World Models. (arXiv:2206.01474v1 [cs.LG])
    Model-based methods have recently shown promising for offline reinforcement learning (RL), aiming to learn good policies from historical data without interacting with the environment. Previous model-based offline RL methods learn fully connected nets as world-models that map the states and actions to the next-step states. However, it is sensible that a world-model should adhere to the underlying causal effect such that it will support learning an effective policy generalizing well in unseen states. In this paper, We first provide theoretical results that causal world-models can outperform plain world-models for offline RL by incorporating the causal structure into the generalization error bound. We then propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structure (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly. Consequently, it performs better than the plain model-based offline RL algorithms and other causal model-based RL algorithms.
    Towards Accelerating Training of Batch Normalization: A Manifold Perspective. (arXiv:2101.02916v2 [cs.LG] UPDATED)
    Batch normalization (BN) has become a critical component across diverse deep neural networks. The network with BN is invariant to positively linear re-scale transformation, which makes there exist infinite functionally equivalent networks with different scales of weights. However, optimizing these equivalent networks with the first-order method such as stochastic gradient descent will obtain a series of iterates converging to different local optima owing to their different gradients across training. To obviate this, we propose a quotient manifold \emph{PSI manifold}, in which all the equivalent weights of the network with BN are regarded as the same element. Next, we construct gradient descent and stochastic gradient descent on the proposed PSI manifold to train the network with BN. The two algorithms guarantee that every group of equivalent weights (caused by positively re-scaling) converge to the equivalent optima. Besides that, we give convergence rates of the proposed algorithms on the PSI manifold. The results show that our methods accelerate training compared with the algorithms on the Euclidean weight space. Finally, empirical results verify that our algorithms consistently improve the existing methods in both convergence rate and generalization ability under various experimental settings.
    Indirect Active Learning. (arXiv:2206.01454v1 [math.ST])
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
    On the Privacy Properties of GAN-generated Samples. (arXiv:2206.01349v1 [cs.LG])
    The privacy implications of generative adversarial networks (GANs) are a topic of great interest, leading to several recent algorithms for training GANs with privacy guarantees. By drawing connections to the generalization properties of GANs, we prove that under some assumptions, GAN-generated samples inherently satisfy some (weak) privacy guarantees. First, we show that if a GAN is trained on m samples and used to generate n samples, the generated samples are (epsilon, delta)-differentially-private for (epsilon, delta) pairs where delta scales as O(n/m). We show that under some special conditions, this upper bound is tight. Next, we study the robustness of GAN-generated samples to membership inference attacks. We model membership inference as a hypothesis test in which the adversary must determine whether a given sample was drawn from the training dataset or from the underlying data distribution. We show that this adversary can achieve an area under the ROC curve that scales no better than O(m^{-1/4}).
    Expressiveness and Learnability: A Unifying View for Evaluating Self-Supervised Learning. (arXiv:2206.01251v1 [cs.LG])
    We propose a unifying view to analyze the representation quality of self-supervised learning (SSL) models without access to supervised labels, while being agnostic to the architecture, learning algorithm or data manipulation used during training. We argue that representations can be evaluated through the lens of expressiveness and learnability. We propose to use the Intrinsic Dimension (ID) to assess expressiveness and introduce Cluster Learnability (CL) to assess learnability. CL is measured as the learning speed of a KNN classifier trained to predict labels obtained by clustering the representations with K-means. We thus combine CL and ID into a single predictor: CLID. Through a large-scale empirical study with a diverse family of SSL algorithms, we find that CLID better correlates with in-distribution model performance than other competing recent evaluation schemes. We also benchmark CLID on out-of-domain generalization, where CLID serves as a predictor of the transfer performance of SSL models on several classification tasks, yielding improvements with respect to the competing baselines.
    Optimal Activation Functions for the Random Features Regression Model. (arXiv:2206.01332v1 [stat.ML])
    The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well-established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.
    A Learning-Based Method for Automatic Operator Selection in the Fanoos XAI System. (arXiv:2206.01722v1 [cs.LG])
    We describe an extension of the Fanoos XAI system [Bayani et al 2022] which enables the system to learn the appropriate action to take in order to satisfy a user's request for description to be made more or less abstract. Specifically, descriptions of systems under analysis are stored in states, and in order to make a description more or less abstract, Fanoos selects an operator from a large library to apply to the state and generate a new description. Prior work on Fanoos predominately used hand-written methods for operator-selection; this current work allows Fanoos to leverage experience to learn the best operator to apply in a particular situation, balancing exploration and exploitation, leveraging expert insights when available, and utilizing similarity between the current state and past states. Additionally, in order to bootstrap the learning process (i.e., like in curriculum learning), we describe a simulated user which we implemented; this simulation allows Fanoos to gain general insights that enable reasonable courses of action, insights which later can be refined by experience with real users, as opposed to interacting with humans completely from scratch. Code implementing the methods described in the paper can be found at https://github/DBay-ani/Operator_Selection_Learning_Extensions_For_Fanoos.
    Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks. (arXiv:2104.05097v5 [cs.LG] UPDATED)
    Lipschitz constrained networks have gathered considerable attention in deep learning community, with usages ranging from Wasserstein distance estimation to the training of certifiably robust classifiers. However they remain commonly considered as less accurate, and their properties in learning are still not fully understood. In this paper we clarify the matter: when it comes to classification 1-Lipschitz neural networks enjoy several advantages over their unconstrained counterpart. First, we show that these networks are as accurate as classical ones, and can fit arbitrarily difficult boundaries. Then, relying on a robustness metric which reflects operational needs we characterize the most robust classifier: the WGAN discriminator. Next, we show that 1-Lipschitz neural networks generalize well under milder assumptions. Finally, we show that hyper-parameters of the loss are crucial for controlling the accuracy-robustness trade-off. We conclude that they exhibit appealing properties to pave the way toward provably accurate, and provably robust neural networks.
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v1 [stat.ML])
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails has links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
    Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning. (arXiv:2206.01690v1 [cs.LG])
    Gradient based meta-learning methods are prone to overfit on the meta-training set, and this behaviour is more prominent with large and complex networks. Moreover, large networks restrict the application of meta-learning models on low-power edge devices. While choosing smaller networks avoid these issues to a certain extent, it affects the overall generalization leading to reduced performance. Clearly, there is an approximately optimal choice of network architecture that is best suited for every meta-learning problem, however, identifying it beforehand is not straightforward. In this paper, we present MetaDOCK, a task-specific dynamic kernel selection strategy for designing compressed CNN models that generalize well on unseen tasks in meta-learning. Our method is based on the hypothesis that for a given set of similar tasks, not all kernels of the network are needed by each individual task. Rather, each task uses only a fraction of the kernels, and the selection of the kernels per task can be learnt dynamically as a part of the inner update steps. MetaDOCK compresses the meta-model as well as the task-specific inner models, thus providing significant reduction in model size for each task, and through constraining the number of active kernels for every task, it implicitly mitigates the issue of meta-overfitting. We show that for the same inference budget, pruned versions of large CNN models obtained using our approach consistently outperform the conventional choices of CNN models. MetaDOCK couples well with popular meta-learning approaches such as iMAML. The efficacy of our method is validated on CIFAR-fs and mini-ImageNet datasets, and we have observed that our approach can provide improvements in model accuracy of up to 2% on standard meta-learning benchmark, while reducing the model size by more than 75%.
    On Calibration of Graph Neural Networks for Node Classification. (arXiv:2206.01570v1 [cs.LG])
    Graphs can model real-world, complex systems by representing entities and their interactions in terms of nodes and edges. To better exploit the graph structure, graph neural networks have been developed, which learn entity and edge embeddings for tasks such as node classification and link prediction. These models achieve good performance with respect to accuracy, but the confidence scores associated with the predictions might not be calibrated. That means that the scores might not reflect the ground-truth probabilities of the predicted events, which would be especially important for safety-critical applications. Even though graph neural networks are used for a wide range of tasks, the calibration thereof has not been sufficiently explored yet. We investigate the calibration of graph neural networks for node classification, study the effect of existing post-processing calibration methods, and analyze the influence of model capacity, graph density, and a new loss function on calibration. Further, we propose a topology-aware calibration method that takes the neighboring nodes into account and yields improved calibration compared to baseline methods.
    Excess risk analysis for epistemic uncertainty with application to variational inference. (arXiv:2206.01606v1 [stat.ML])
    We analyze the epistemic uncertainty (EU) of supervised learning in Bayesian inference by focusing on the excess risk. Existing analysis is limited to the Bayesian setting, which assumes a correct model and exact Bayesian posterior distribution. Thus we cannot apply the existing theory to modern Bayesian algorithms, such as variational inference. To address this, we present a novel EU analysis in the frequentist setting, where data is generated from an unknown distribution. We show a relation between the generalization ability and the widely used EU measurements, such as the variance and entropy of the predictive distribution. Then we show their convergence behaviors theoretically. Finally, we propose new variational inference that directly controls the prediction and EU evaluation performances based on the PAC-Bayesian theory. Numerical experiments show that our algorithm significantly improves the EU evaluation over the existing methods.
    PROMISSING: Pruning Missing Values in Neural Networks. (arXiv:2206.01640v1 [cs.LG])
    While data are the primary fuel for machine learning models, they often suffer from missing values, especially when collected in real-world scenarios. However, many off-the-shelf machine learning models, including artificial neural network models, are unable to handle these missing values directly. Therefore, extra data preprocessing and curation steps, such as data imputation, are inevitable before learning and prediction processes. In this study, we propose a simple and intuitive yet effective method for pruning missing values (PROMISSING) during learning and inference steps in neural networks. In this method, there is no need to remove or impute the missing values; instead, the missing values are treated as a new source of information (representing what we do not know). Our experiments on simulated data, several classification and regression benchmarks, and a multi-modal clinical dataset show that PROMISSING results in similar prediction performance compared to various imputation techniques. In addition, our experiments show models trained using PROMISSING techniques are becoming less decisive in their predictions when facing incomplete samples with many unknowns. This finding hopefully advances machine learning models from being pure predicting machines to more realistic thinkers that can also say "I do not know" when facing incomplete sources of information.
    Rate-Optimal Online Convex Optimization in Adaptive Linear Control. (arXiv:2206.01426v1 [cs.LG])
    We consider the problem of controlling an unknown linear dynamical system under adversarially changing convex costs and full feedback of both the state and cost function. We present the first computationally-efficient algorithm that attains an optimal $\smash{\sqrt{T}}$-regret rate compared to the best stabilizing linear controller in hindsight, while avoiding stringent assumptions on the costs such as strong convexity. Our approach is based on a careful design of non-convex lower confidence bounds for the online costs, and uses a novel technique for computationally-efficient regret minimization of these bounds that leverages their particular non-convex structure.
    Lossy Gradient Compression: How Much Accuracy Can One Bit Buy?. (arXiv:2202.02812v2 [cs.LG] UPDATED)
    In federated learning (FL), a global model is trained at a Parameter Server (PS) by aggregating model updates obtained from multiple remote learners. Generally, the communication between the remote users and the PS is rate-limited, while the transmission from the PS to the remote users are unconstrained. The FL setting gives rise to the distributed learning scenario in which the updates from the remote learners have to be compressed so as to meet communication rate constraints in the uplink transmission toward the PS. For this problem, one wishes to compress the model updates so as to minimize the loss in accuracy resulting from the compression error. In this paper, we take a rate-distortion approach to address the compressor design problem for the distributed training of deep neural networks (DNNs). In particular, we define a measure of the compression performance under communication-rate constraints -- the \emph{per-bit accuracy} -- which addresses the ultimate improvement of accuracy that a bit of communication brings to the centralized model. In order to maximize the per-bit accuracy, we consider modeling the DNN gradient updates at remote learners as a generalized normal distribution. Under this assumption on the DNN gradient distribution, we propose a class of distortion measures to aid the design of quantizers for the compression of the model updates. We argue that this family of distortion measures, which we refer to as "$M$-magnitude weighted $L_2$" norm, captures the practitioner's intuition in the choice of gradient compressor. Numerical simulations are provided to validate the proposed approach for the CIFAR-10 dataset.
    Functional Connectivity Methods for EEG-based Biometrics on a Large, Heterogeneous Dataset. (arXiv:2206.01475v1 [eess.SP])
    This study examines the utility of functional connectivity (FC) and graph-based (GB) measures with a support vector machine classifier for use in electroencephalogram (EEG) based biometrics. Although FC-based features have been used in biometric applications, studies assessing the identification algorithms on heterogeneous and large datasets are scarce. This work investigates the performance of FC and GB metrics on a dataset of 184 subjects formed by pooling three datasets recorded under different protocols and acquisition systems. The results demonstrate the higher discriminatory power of FC than GB metrics. The identification accuracy increases with higher frequency EEG bands, indicating the enhanced uniqueness of the neural signatures in beta and gamma bands. Using all the 56 EEG channels common to the three databases, the best identification accuracy of 97.4% is obtained using phase-locking value (PLV) based measures extracted from the gamma frequency band. Further, we investigate the effect of the length of the analysis epoch to determine the data acquisition time required to obtain satisfactory identification accuracy. When the number of channels is reduced to 21 from 56, there is a marginal reduction of 2.4% only in the identification accuracy using PLV features in the gamma band. Additional experiments have been conducted to study the effect of the cognitive state of the subject and mismatched train/test conditions on the performance of the system.
    Prescriptive maintenance with causal machine learning. (arXiv:2206.01562v1 [econ.GN])
    Machine maintenance is a challenging operational problem, where the goal is to plan sufficient preventive maintenance to avoid machine failures and overhauls. Maintenance is often imperfect in reality and does not make the asset as good as new. Although a variety of imperfect maintenance policies have been proposed in the literature, these rely on strong assumptions regarding the effect of maintenance on the machine's condition, assuming the effect is (1) deterministic or governed by a known probability distribution, and (2) machine-independent. This work proposes to relax both assumptions by learning the effect of maintenance conditional on a machine's characteristics from observational data on similar machines using existing methodologies for causal inference. By predicting the maintenance effect, we can estimate the number of overhauls and failures for different levels of maintenance and, consequently, optimize the preventive maintenance frequency to minimize the total estimated cost. We validate our proposed approach using real-life data on more than 4,000 maintenance contracts from an industrial partner. Empirical results show that our novel, causal approach accurately predicts the maintenance effect and results in individualized maintenance schedules that are more accurate and cost-effective than supervised or non-individualized approaches.  ( 2 min )
    Algorithm for Constrained Markov Decision Process with Linear Convergence. (arXiv:2206.01666v1 [math.OC])
    The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vaidya's dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.  ( 2 min )
    Measuring Gender Bias in Word Embeddings of Gendered Languages Requires Disentangling Grammatical Gender Signals. (arXiv:2206.01691v1 [cs.CY])
    Does the grammatical gender of a language interfere when measuring the semantic gender information captured by its word embeddings? A number of anomalous gender bias measurements in the embeddings of gendered languages suggest this possibility. We demonstrate that word embeddings learn the association between a noun and its grammatical gender in grammatically gendered languages, which can skew social gender bias measurements. Consequently, word embedding post-processing methods are introduced to quantify, disentangle, and evaluate grammatical gender signals. The evaluation is performed on five gendered languages from the Germanic, Romance, and Slavic branches of the Indo-European language family. Our method reduces the strength of grammatical gender signals, which is measured in terms of effect size (Cohen's d), by a significant average of d = 1.3 for French, German, and Italian, and d = 0.56 for Polish and Spanish. Once grammatical gender is disentangled, the association between over 90% of 10,000 inanimate nouns and their assigned grammatical gender weakens, and cross-lingual bias results from the Word Embedding Association Test (WEAT) become more congruent with country-level implicit bias measurements. The results further suggest that disentangling grammatical gender signals from word embeddings may lead to improvement in semantic machine learning tasks.  ( 2 min )
    Joint Energy Dispatch and Unit Commitment in Microgrids Based on Deep Reinforcement Learning. (arXiv:2206.01663v1 [cs.LG])
    Nowadays, the application of microgrids (MG) with renewable energy is becoming more and more extensive, which creates a strong need for dynamic energy management. In this paper, deep reinforcement learning (DRL) is applied to learn an optimal policy for making joint energy dispatch (ED) and unit commitment (UC) decisions in an isolated MG, with the aim for reducing the total power generation cost on the premise of ensuring the supply-demand balance. In order to overcome the challenge of discrete-continuous hybrid action space due to joint ED and UC, we propose a DRL algorithm, i.e., the hybrid action finite-horizon DDPG (HAFH-DDPG), that seamlessly integrates two classical DRL algorithms, i.e., deep Q-network (DQN) and deep deterministic policy gradient (DDPG), based on a finite-horizon dynamic programming (DP) framework. Moreover, a diesel generator (DG) selection strategy is presented to support a simplified action space for reducing the computation complexity of this algorithm. Finally, the effectiveness of our proposed algorithm is verified through comparison with several baseline algorithms by experiments with real-world data set.  ( 2 min )
    Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games. (arXiv:2206.01588v1 [cs.LG])
    We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep{liu2022learning}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy m\underline{I}rror de\underline{S}cent (DORIS), which achieves $\sqrt{K}$-regret in the context of general function approximation, where $K$ is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a \textit{hyperpolicy} which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.  ( 2 min )
    OmniXAI: A Library for Explainable AI. (arXiv:2206.01612v1 [cs.LG])
    We introduce OmniXAI, an open-source Python library of eXplainable AI (XAI), which offers omni-way explainable AI capabilities and various interpretable machine learning techniques to address the pain points of understanding and interpreting the decisions made by machine learning (ML) in practice. OmniXAI aims to be a one-stop comprehensive library that makes explainable AI easy for data scientists, ML researchers and practitioners who need explanation for various types of data, models and explanation methods at different stages of ML process (data exploration, feature engineering, model development, evaluation, and decision-making, etc). In particular, our library includes a rich family of explanation methods integrated in a unified interface, which supports multiple data types (tabular data, images, texts, time-series), multiple types of ML models (traditional ML in Scikit-learn and deep learning models in PyTorch/TensorFlow), and a range of diverse explanation methods including "model-specific" and "model-agnostic" ones (such as feature-attribution explanation, counterfactual explanation, gradient-based explanation, etc). For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications by only writing a few lines of codes, and also a GUI dashboard for visualization of different explanations for more insights about decisions. In this technical report, we present OmniXAI's design principles, system architectures, and major functionalities, and also demonstrate several example use cases across different types of data, tasks, and models.  ( 2 min )
    MCD: Marginal Contrastive Discrimination for conditional density estimation. (arXiv:2206.01592v1 [stat.ML])
    We consider the problem of conditional density estimation, which is a major topic of interest in the fields of statistical and machine learning. Our method, called Marginal Contrastive Discrimination, MCD, reformulates the conditional density function into two factors, the marginal density function of the target variable and a ratio of density functions which can be estimated through binary classification. Like noise-contrastive methods, MCD can leverage state-of-the-art supervised learning techniques to perform conditional density estimation, including neural networks. Our benchmark reveals that our method significantly outperforms in practice existing methods on most density models and regression datasets.  ( 2 min )
    A Comparative Study on Energy Consumption Models for Drones. (arXiv:2206.01609v1 [cs.RO])
    Creating an appropriate energy consumption prediction model is becoming an important topic for drone-related research in the literature. However, a general consensus on the energy consumption model is yet to be reached at present. As a result, there are many variations that attempt to create models that range in complexity with a focus on different aspects. In this paper, we benchmark the five most popular energy consumption models for drones derived from their physical behaviours and point to the difficulties in matching with a realistic energy dataset collected from a delivery drone in flight under different testing conditions. Moreover, we propose a novel data-driven energy model using the Long Short-Term Memory (LSTM) based deep learning architecture and the accuracy is compared based on the dataset. Our experimental results have shown that the LSTM based approach can easily outperform other mathematical models for the dataset under study. Finally, sensitivity analysis has been carried out in order to interpret the model.  ( 2 min )
    Pruning for Interpretable, Feature-Preserving Circuits in CNNs. (arXiv:2206.01627v1 [cs.CV])
    Deep convolutional neural networks are a powerful model class for a range of computer vision problems, but it is difficult to interpret the image filtering process they implement, given their sheer size. In this work, we introduce a method for extracting 'feature-preserving circuits' from deep CNNs, leveraging methods from saliency-based neural network pruning. These circuits are modular sub-functions, embedded within the network, containing only a subset of convolutional kernels relevant to a target feature. We compare the efficacy of 3 saliency-criteria for extracting these sparse circuits. Further, we show how 'sub-feature' circuits can be extracted, that preserve a feature's responses to particular images, dividing the feature into even sparser filtering processes. We also develop a tool for visualizing 'circuit diagrams', which render the entire image filtering process implemented by circuits in a parsable format.  ( 2 min )
    Truly Mesh-free Physics-Informed Neural Networks. (arXiv:2206.01545v1 [cs.LG])
    Physics-informed Neural Networks (PINNs) have recently emerged as a principled way to include prior physical knowledge in form of partial differential equations (PDEs) into neural networks. Although generally viewed as being mesh-free, current approaches still rely on collocation points obtained within a bounded region, even in settings with spatially sparse signals. Furthermore, if the boundaries are not known, the selection of such a region may be arbitrary, resulting in a large proportion of collocation points being selected in areas of low relevance. To resolve this, we present a mesh-free and adaptive approach termed particle-density PINN (pdPINN), which is inspired by the microscopic viewpoint of fluid dynamics. Instead of sampling from a bounded region, we propose to sample directly from the distribution over the (fluids) particle positions, eliminating the need to introduce boundaries while adaptively focusing on the most relevant regions. This is achieved by reformulating the modeled fluid density as an unnormalized probability distribution from which we sample with dynamic Monte Carlo methods. We further generalize pdPINNs to different settings that allow interpreting a positive scalar quantity as a particle density, such as the evolution of the temperature in the heat equation. The utility of our approach is demonstrated on experiments for modeling (non-steady) compressible fluids in up to three dimensions and a two-dimensional diffusion problem, illustrating the high flexibility and sample efficiency compared to existing refinement methods for PINNs.  ( 2 min )
    Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules. (arXiv:2206.01649v1 [cs.LG])
    Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed. Since the 1980s, ODEs have also been used to derive theoretical results for NN learning rules, e.g., the famous connection between Oja's rule and principal component analysis. Such rules are typically expressed as additive iterative update processes which have straightforward ODE counterparts. Here we introduce a novel combination of learning rules and Neural ODEs to build continuous-time sequence processing nets that learn to manipulate short-term memory in rapidly changing synaptic connections of other nets. This yields continuous-time counterparts of Fast Weight Programmers and linear Transformers. Our novel models outperform the best existing Neural Controlled Differential Equation based models on various time series classification tasks, while also addressing their scalability limitations. Our code is public.  ( 2 min )
    Beyond Tabula Rasa: Reincarnating Reinforcement Learning. (arXiv:2206.01626v1 [cs.LG])
    Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further.  ( 2 min )
    Beyond Opinion Mining: Summarizing Opinions of Customer Reviews. (arXiv:2206.01543v1 [cs.CL])
    Customer reviews are vital for making purchasing decisions in the Information Age. Such reviews can be automatically summarized to provide the user with an overview of opinions. In this tutorial, we present various aspects of opinion summarization that are useful for researchers and practitioners. First, we will introduce the task and major challenges. Then, we will present existing opinion summarization solutions, both pre-neural and neural. We will discuss how summarizers can be trained in the unsupervised, few-shot, and supervised regimes. Each regime has roots in different machine learning methods, such as auto-encoding, controllable text generation, and variational inference. Finally, we will discuss resources and evaluation methods and conclude with the future directions. This three-hour tutorial will provide a comprehensive overview over major advances in opinion summarization. The listeners will be well-equipped with the knowledge that is both useful for research and practical applications.  ( 2 min )
    RACA: Relation-Aware Credit Assignment for Ad-Hoc Cooperation in Multi-Agent Deep Reinforcement Learning. (arXiv:2206.01207v1 [cs.LG])
    In recent years, reinforcement learning has faced several challenges in the multi-agent domain, such as the credit assignment issue. Value function factorization emerges as a promising way to handle the credit assignment issue under the centralized training with decentralized execution (CTDE) paradigm. However, existing value function factorization methods cannot deal with ad-hoc cooperation, that is, adapting to new configurations of teammates at test time. Specifically, these methods do not explicitly utilize the relationship between agents and cannot adapt to different sizes of inputs. To address these limitations, we propose a novel method, called Relation-Aware Credit Assignment (RACA), which achieves zero-shot generalization in ad-hoc cooperation scenarios. RACA takes advantage of a graph-based relation encoder to encode the topological structure between agents. Furthermore, RACA utilizes an attention-based observation abstraction mechanism that can generalize to an arbitrary number of teammates with a fixed number of parameters. Experiments demonstrate that our method outperforms baseline methods on the StarCraftII micromanagement benchmark and ad-hoc cooperation scenarios.  ( 2 min )
    Accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient. (arXiv:2206.01209v1 [math.OC])
    In this paper we develop accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient (LLCG), which is beyond the well-studied class of convex optimization with Lipschitz continuous gradient. In particular, we first consider unconstrained convex optimization with LLCG and propose accelerated proximal gradient (APG) methods for solving it. The proposed APG methods are equipped with a verifiable termination criterion and enjoy an operation complexity of ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ and ${\cal O}(\log \varepsilon^{-1})$ for finding an $\varepsilon$-residual solution of an unconstrained convex and strongly convex optimization problem, respectively. We then consider constrained convex optimization with LLCG and propose an first-order proximal augmented Lagrangian method for solving it by applying one of our proposed APG methods to approximately solve a sequence of proximal augmented Lagrangian subproblems. The resulting method is equipped with a verifiable termination criterion and enjoys an operation complexity of ${\cal O}(\varepsilon^{-1}\log \varepsilon^{-1})$ and ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ for finding an $\varepsilon$-KKT solution of a constrained convex and strongly convex optimization problem, respectively. All the proposed methods in this paper are parameter-free or almost parameter-free except that the knowledge on convexity parameter is required. To the best of our knowledge, no prior studies were conducted to investigate accelerated first-order methods with complexity guarantees for convex optimization with LLCG. All the complexity results obtained in this paper are entirely new.  ( 2 min )
    Positive Unlabeled Contrastive Learning. (arXiv:2206.01206v1 [cs.LG])
    Self-supervised pretraining on unlabeled data followed by supervised finetuning on labeled data is a popular paradigm for learning from limited labeled examples. In this paper, we investigate and extend this paradigm to the classical positive unlabeled (PU) setting - the weakly supervised task of learning a binary classifier only using a few labeled positive examples and a set of unlabeled samples. We propose a novel PU learning objective positive unlabeled Noise Contrastive Estimation (puNCE) that leverages the available explicit (from labeled samples) and implicit (from unlabeled samples) supervision to learn useful representations from positive unlabeled input data. The underlying idea is to assign each training sample an individual weight; labeled positives are given unit weight; unlabeled samples are duplicated, one copy is labeled positive and the other as negative with weights $\pi$ and $(1-\pi)$ where $\pi$ denotes the class prior. Extensive experiments across vision and natural language tasks reveal that puNCE consistently improves over existing unsupervised and supervised contrastive baselines under limited supervision.  ( 2 min )
    Is an encoder within reach?. (arXiv:2206.01552v1 [cs.LG])
    The encoder network of an autoencoder is an approximation of the nearest point projection onto the manifold spanned by the decoder. A concern with this approximation is that, while the output of the encoder is always unique, the projection can possibly have infinitely many values. This implies that the latent representations learned by the autoencoder can be misleading. Borrowing from geometric measure theory, we introduce the idea of using the reach of the manifold spanned by the decoder to determine if an optimal encoder exists for a given dataset and decoder. We develop a local generalization of this reach and propose a numerical estimator thereof. We demonstrate that this allows us to determine which observations can be expected to have a unique, and thereby trustworthy, latent representation. As our local reach estimator is differentiable, we investigate its usage as a regularizer and show that this leads to learned manifolds for which projections are more often unique than without regularization.  ( 2 min )
    Detecting the Severity of Major Depressive Disorder from Speech: A Novel HARD-Training Methodology. (arXiv:2206.01542v1 [cs.SD])
    Major Depressive Disorder (MDD) is a common worldwide mental health issue with high associated socioeconomic costs. The prediction and automatic detection of MDD can, therefore, make a huge impact on society. Speech, as a non-invasive, easy to collect signal, is a promising marker to aid the diagnosis and assessment of MDD. In this regard, speech samples were collected as part of the Remote Assessment of Disease and Relapse in Major Depressive Disorder (RADAR-MDD) research programme. RADAR-MDD was an observational cohort study in which speech and other digital biomarkers were collected from a cohort of individuals with a history of MDD in Spain, United Kingdom and the Netherlands. In this paper, the RADAR-MDD speech corpus was taken as an experimental framework to test the efficacy of a Sequence-to-Sequence model with a local attention mechanism in a two-class depression severity classification paradigm. Additionally, a novel training method, HARD-Training, is proposed. It is a methodology based on the selection of more ambiguous samples for the model training, and inspired by the curriculum learning paradigm. HARD-Training was found to consistently improve - with an average increment of 8.6% - the performance of our classifiers for both of two speech elicitation tasks used and each collection site of the RADAR-MDD speech corpus. With this novel methodology, our Sequence-to-Sequence model was able to effectively detect MDD severity regardless of language. Finally, recognising the need for greater awareness of potential algorithmic bias, we conduct an additional analysis of our results separately for each gender.  ( 2 min )
    A High-Performance Customer Churn Prediction System based on Self-Attention. (arXiv:2206.01523v1 [cs.LG])
    Customer churn prediction is a challenging domain of research that contributes to customer retention strategy. The predictive performance of existing machine learning models, which are often adopted by churn communities, appear to be at a bottleneck, partly due to models' poor feature extraction capability. Therefore, a novel algorithm, a hybrid neural network with self-attention enhancement (HNNSAE), is proposed in this paper to improve the efficiency of feature screening and feature extraction, consequently improving the model's predictive performance. This model consists of three main blocks. The first block is the entity embedding layer, which is employed to process the categorical variables transformed into 0-1 code. The second block is the feature extractor, which extracts the significant features through the multi-head self-attention mechanism. In addition, to improve the feature extraction effect, we stack the residual connection neural network on multi-head self-attention modules. The third block is a classifier, which is a three-layer multilayer perceptron. This work conducts experiments on publicly available dataset related to commercial bank customers. The result demonstrates that HNNSAE significantly outperforms the other Individual Machine Learning (IML), Ensemble Machine Learning (EML), and Deep Learning (DL) methods tested in this paper. Furthermore, we compare the performance of the feature extractor proposed in this paper with that of other three feature extractors and find that the method proposed in this paper significantly outperforms other methods. In addition, four hypotheses about model prediction performance and overfitting risk are tested on the publicly available dataset.  ( 2 min )
    Rethinking and Scaling Up Graph Contrastive Learning: An Extremely Efficient Approach with Group Discrimination. (arXiv:2206.01535v1 [cs.LG])
    Graph contrastive learning (GCL) alleviates the heavy reliance on label information for graph representation learning (GRL) via self-supervised learning schemes. The core idea is to learn by maximising mutual information for similar instances, which requires similarity computation between two node instances. However, this operation can be computationally expensive. For example, the time complexity of two commonly adopted contrastive loss functions (i.e., InfoNCE and JSD estimator) for a node is $O(ND)$ and $O(D)$, respectively, where $N$ is the number of nodes, and $D$ is the embedding dimension. Additionally, GCL normally requires a large number of training epochs to be well-trained on large-scale datasets. Inspired by an observation of a technical defect (i.e., inappropriate usage of Sigmoid function) commonly used in two representative GCL works, DGI and MVGRL, we revisit GCL and introduce a new learning paradigm for self-supervised GRL, namely, Group Discrimination (GD), and propose a novel GD-based method called Graph Group Discrimination (GGD). Instead of similarity computation, GGD directly discriminates two groups of summarised node instances with a simple binary cross-entropy loss. As such, GGD only requires $O(1)$ for loss computation of a node. In addition, GGD requires much fewer training epochs to obtain competitive performance compared with GCL methods on large-scale datasets. These two advantages endow GGD with the very efficient property. Extensive experiments show that GGD outperforms state-of-the-art self-supervised methods on 8 datasets. In particular, GGD can be trained in 0.18 seconds (6.44 seconds including data preprocessing) on ogbn-arxiv, which is orders of magnitude (10,000+ faster than GCL baselines} while consuming much less memory. Trained with 9 hours on ogbn-papers100M with billion edges, GGD outperforms its GCL counterparts in both accuracy and efficiency.  ( 2 min )
    Understanding deep learning via decision boundary. (arXiv:2206.01515v1 [cs.LG])
    This paper discovers that the neural network with lower decision boundary (DB) variability has better generalizability. Two new notions, algorithm DB variability and $(\epsilon, \eta)$-data DB variability, are proposed to measure the decision boundary variability from the algorithm and data perspectives. Extensive experiments show significant negative correlations between the decision boundary variability and the generalizability. From the theoretical view, two lower bounds based on algorithm DB variability are proposed and do not explicitly depend on the sample size. We also prove an upper bound of order $\mathcal{O}\left(\frac{1}{\sqrt{m}}+\epsilon+\eta\log\frac{1}{\eta}\right)$ based on data DB variability. The bound is convenient to estimate without the requirement of labels, and does not explicitly depend on the network size which is usually prohibitively large in deep learning.  ( 2 min )
    Latent Topology Induction for Understanding Contextualized Representations. (arXiv:2206.01512v1 [cs.CL])
    In this work, we study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models. We show there exists a network of latent states that summarize linguistic properties of contextualized representations. Instead of seeking alignments to existing well-defined annotations, we infer this latent network in a fully unsupervised way using a structured variational autoencoder. The induced states not only serve as anchors that mark the topology (neighbors and connectivity) of the representation manifold but also reveal the internal mechanism of encoding sentences. With the induced network, we: (1). decompose the representation space into a spectrum of latent states which encode fine-grained word meanings with lexical, morphological, syntactic and semantic information; (2). show state-state transitions encode rich phrase constructions and serve as the backbones of the latent space. Putting the two together, we show that sentences are represented as a traversal over the latent network where state-state transition chains encode syntactic templates and state-word emissions fill in the content. We demonstrate these insights with extensive experiments and visualizations.  ( 2 min )
    Canonical convolutional neural networks. (arXiv:2206.01509v1 [cs.LG])
    We introduce canonical weight normalization for convolutional neural networks. Inspired by the canonical tensor decomposition, we express the weight tensors in so-called canonical networks as scaled sums of outer vector products. In particular, we train network weights in the decomposed form, where scale weights are optimized separately for each mode. Additionally, similarly to weight normalization, we include a global scaling parameter. We study the initialization of the canonical form by running the power method and by drawing randomly from Gaussian or uniform distributions. Our results indicate that we can replace the power method with cheaper initializations drawn from standard distributions. The canonical re-parametrization leads to competitive normalization performance on the MNIST, CIFAR10, and SVHN data sets. Moreover, the formulation simplifies network compression. Once training has converged, the canonical form allows convenient model-compression by truncating the parameter sums.  ( 2 min )
    Can Hybrid Geometric Scattering Networks Help Solve the Maximal Clique Problem?. (arXiv:2206.01506v1 [cs.LG])
    We propose a geometric scattering-based graph neural network (GNN) for approximating solutions of the NP-hard maximal clique (MC) problem. We construct a loss function with two terms, one which encourages the network to find a large set of nodes and the other which acts as a surrogate for the constraint that the nodes form a clique. We then use this loss to train a novel GNN architecture that outputs a vector representing the probability for each node to be part of the MC and apply a rule-based decoder to make our final prediction. The incorporation of the scattering transform alleviates the so-called oversmoothing problem that is often encountered in GNNs and would degrade the performance of our proposed setup. Our empirical results demonstrate that our method outperforms representative GNN baselines in terms of solution accuracy and inference speed as well as conventional solvers like GUROBI with limited time budgets.  ( 2 min )
    Transferring Studies Across Embodiments: A Case Study in Confusion Detection. (arXiv:2206.01493v1 [cs.HC])
    Human-robot studies are expensive to conduct and difficult to control, and as such researchers sometimes turn to human-avatar interaction in the hope of faster and cheaper data collection that can be transferred to the robot domain. In terms of our work, we are particularly interested in the challenge of detecting and modelling user confusion in interaction, and as part of this research programme, we conducted situated dialogue studies to investigate users' reactions in confusing scenarios that we give in both physical and virtual environments. In this paper, we present a combined review of these studies and the results that we observed across these two embodiments. For the physical embodiment, we used a Pepper Robot, while for the virtual modality, we used a 3D avatar. Our study shows that despite attitudinal differences and technical control limitations, there were a number of similarities detected in user behaviour and self-reporting results across embodiment options. This work suggests that, while avatar interaction is no true substitute for robot interaction studies, sufficient care in study design may allow well executed human-avatar studies to supplement more challenging human-robot studies.  ( 2 min )
    Causality Learning With Wasserstein Generative Adversarial Networks. (arXiv:2206.01496v1 [cs.LG])
    Conventional methods for causal structure learning from data face significant challenges due to combinatorial search space. Recently, the problem has been formulated into a continuous optimization framework with an acyclicity constraint to learn Directed Acyclic Graphs (DAGs). Such a framework allows the utilization of deep generative models for causal structure learning to better capture the relations between data sample distributions and DAGs. However, so far no study has experimented with the use of Wasserstein distance in the context of causal structure learning. Our model named DAG-WGAN combines the Wasserstein-based adversarial loss with an acyclicity constraint in an auto-encoder architecture. It simultaneously learns causal structures while improving its data generation capability. We compare the performance of DAG-WGAN with other models that do not involve the Wasserstein metric in order to identify its contribution to causal structure learning. Our model performs better with high cardinality data according to our experiments.  ( 2 min )
    Finding Rule-Interpretable Non-Negative Data Representation. (arXiv:2206.01483v1 [cs.LG])
    Non-negative Matrix Factorization (NMF) is an intensively used technique for obtaining parts-based, lower dimensional and non-negative representation of non-negative data. It is a popular method in different research fields. Scientists performing research in the fields of biology, medicine and pharmacy often prefer NMF over other dimensionality reduction approaches (such as PCA) because the non-negativity of the approach naturally fits the characteristics of the domain problem and its result is easier to analyze and understand. Despite these advantages, it still can be hard to get exact characterization and interpretation of the NMF's resulting latent factors due to their numerical nature. On the other hand, rule-based approaches are often considered more interpretable but lack the parts-based interpretation. In this work, we present a version of the NMF approach that merges rule-based descriptions with advantages of part-based representation offered by the NMF approach. Given the numerical input data with non-negative entries and a set of rules with high entity coverage, the approach creates the lower-dimensional non-negative representation of the input data in such a way that its factors are described by the appropriate subset of the input rules. In addition to revealing important attributes for latent factors, it allows analyzing relations between these attributes and provides the exact numerical intervals or categorical values they take. The proposed approach provides numerous advantages in tasks such as focused embedding or performing supervised multi-label NMF.  ( 2 min )
    Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. (arXiv:2206.01246v1 [cond-mat.dis-nn])
    Generalization is one of the most important problems in deep learning (DL). In the overparameterized regime in neural networks, there exist many low-loss solutions that fit the training data equally well. The key question is which solution is more generalizable. Empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand how SGD drives the learning system to flat solutions, we construct a simple model whose loss landscape has a continuous set of degenerate (or near degenerate) minima. By solving the Fokker-Planck equation of the underlying stochastic learning dynamics, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD-loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. Furthermore, a stronger SGD noise shortens the convergence time to the flat solutions. However, we identify an upper bound for the SGD noise beyond which the system fails to converge. Our results not only elucidate the role of SGD for generalization they may also have important implications for hyperparameter selection for learning efficiently without divergence.  ( 2 min )
    Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages. (arXiv:2206.01205v1 [eess.AS])
    Automatic Speech Recognition (ASR) has increasing utility in the modern world. There are a many ASR models available for languages with large amounts of training data like English. However, low-resource languages are poorly represented. In response we create and release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.  ( 2 min )
    Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks. (arXiv:2206.01278v1 [cs.LG])
    A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP.  ( 2 min )
    Exponential Separations in Symmetric Neural Networks. (arXiv:2206.01266v1 [cs.LG])
    In this work we demonstrate a novel separation between symmetric neural network architectures. Specifically, we consider the Relational Network~\parencite{santoro2017simple} architecture as a natural generalization of the DeepSets~\parencite{zaheer2017deep} architecture, and study their representational gap. Under the restriction to analytic activation functions, we construct a symmetric function acting on sets of size $N$ with elements in dimension $D$, which can be efficiently approximated by the former architecture, but provably requires width exponential in $N$ and $D$ for the latter.  ( 2 min )
    Entangled Residual Mappings. (arXiv:2206.01261v1 [cs.LG])
    Residual mappings have been shown to perform representation learning in the first layers and iterative feature refinement in higher layers. This interplay, combined with their stabilizing effect on the gradient norms, enables them to train very deep networks. In this paper, we take a step further and introduce entangled residual mappings to generalize the structure of the residual connections and evaluate their role in iterative learning representations. An entangled residual mapping replaces the identity skip connections with specialized entangled mappings such as orthogonal, sparse, and structural correlation matrices that share key attributes (eigenvalues, structure, and Jacobian norm) with identity mappings. We show that while entangled mappings can preserve the iterative refinement of features across various deep models, they influence the representation learning process in convolutional networks differently than attention-based models and recurrent neural networks. In general, we find that for CNNs and Vision Transformers entangled sparse mapping can help generalization while orthogonal mappings hurt performance. For recurrent networks, orthogonal residual mappings form an inductive bias for time-variant sequences, which degrades accuracy on time-invariant tasks.  ( 2 min )
    Deep Learning Architecture Based Approach For 2D-Simulation of Microwave Plasma Interaction. (arXiv:2206.01263v1 [physics.comp-ph])
    This paper presents a convolutional neural network (CNN)-based deep learning model, inspired from UNet with series of encoder and decoder units with skip connections, for the simulation of microwave-plasma interaction. The microwave propagation characteristics in complex plasma medium pertaining to transmission, absorption and reflection primarily depends on the ratio of electromagnetic (EM) wave frequency and electron plasma frequency, and the plasma density profile. The scattering of a plane EM wave with fixed frequency (1 GHz) and amplitude incident on a plasma medium with different gaussian density profiles (in the range of $1\times 10^{17}-1\times 10^{22}{m^{-3}}$) have been considered. The training data associated with microwave-plasma interaction has been generated using 2D-FDTD (Finite Difference Time Domain) based simulations. The trained deep learning model is then used to reproduce the scattered electric field values for the 1GHz incident microwave on different plasma profiles with error margin of less than 2\%. We propose a complete deep learning (DL) based pipeline to train, validate and evaluate the model. We compare the results of the network, using various metrics like SSIM index, average percent error and mean square error, with the physical data obtained from well-established FDTD based EM solvers. To the best of our knowledge, this is the first effort towards exploring a DL based approach for the simulation of complex microwave plasma interaction. The deep learning technique proposed in this work is significantly fast as compared to the existing computational techniques, and can be used as a new, prospective and alternative computational approach for investigating microwave-plasma interaction in a real time scenario.  ( 2 min )
  • Open

    Rashomon Capacity: A Metric for Predictive Multiplicity in Probabilistic Classification. (arXiv:2206.01295v1 [cs.LG])
    Predictive multiplicity occurs when classification models with nearly indistinguishable average performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new measure of predictive multiplicity in probabilistic classification called Rashomon Capacity. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure, report, and ultimately resolve predictive multiplicity prior to model deployment.
    Regularization-wise double descent: Why it occurs and how to eliminate it. (arXiv:2206.01378v1 [cs.LG])
    The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.
    Rate-Optimal Online Convex Optimization in Adaptive Linear Control. (arXiv:2206.01426v1 [cs.LG])
    We consider the problem of controlling an unknown linear dynamical system under adversarially changing convex costs and full feedback of both the state and cost function. We present the first computationally-efficient algorithm that attains an optimal $\smash{\sqrt{T}}$-regret rate compared to the best stabilizing linear controller in hindsight, while avoiding stringent assumptions on the costs such as strong convexity. Our approach is based on a careful design of non-convex lower confidence bounds for the online costs, and uses a novel technique for computationally-efficient regret minimization of these bounds that leverages their particular non-convex structure.
    Optimal Activation Functions for the Random Features Regression Model. (arXiv:2206.01332v1 [stat.ML])
    The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well-established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.
    Excess risk analysis for epistemic uncertainty with application to variational inference. (arXiv:2206.01606v1 [stat.ML])
    We analyze the epistemic uncertainty (EU) of supervised learning in Bayesian inference by focusing on the excess risk. Existing analysis is limited to the Bayesian setting, which assumes a correct model and exact Bayesian posterior distribution. Thus we cannot apply the existing theory to modern Bayesian algorithms, such as variational inference. To address this, we present a novel EU analysis in the frequentist setting, where data is generated from an unknown distribution. We show a relation between the generalization ability and the widely used EU measurements, such as the variance and entropy of the predictive distribution. Then we show their convergence behaviors theoretically. Finally, we propose new variational inference that directly controls the prediction and EU evaluation performances based on the PAC-Bayesian theory. Numerical experiments show that our algorithm significantly improves the EU evaluation over the existing methods.
    The geometry of integration in text classification RNNs. (arXiv:2010.15114v2 [cs.LG] UPDATED)
    Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
    The price of unfairness in linear bandits with biased feedback. (arXiv:2203.09784v2 [math.ST] UPDATED)
    In this paper, we study the problem of fair sequential decision making with biased linear bandit feedback. At each round, a player selects an action described by a covariate and by a sensitive attribute. The perceived reward is a linear combination of the covariates of the chosen action, but the player only observes a biased evaluation of this reward, depending on the sensitive attribute. To characterize the difficulty of this problem, we design a phased elimination algorithm that corrects the unfair evaluations, and establish upper bounds on its regret. We show that the worst-case regret is smaller than $\mathcal{O}(\kappa_*^{1/3}\log(T)^{1/3}T^{2/3})$, where $\kappa_*$ is an explicit geometrical constant characterizing the difficulty of bias estimation. We prove lower bounds on the worst-case regret for some sets of actions showing that this rate is tight up to a possible sub-logarithmic factor. We also derive gap-dependent upper bounds on the regret, and matching lower bounds for some problem instance.Interestingly, these results reveal a transition between a regime where the problem is as difficult as its unbiased counterpart, and a regime where it can be much harder.
    Equipping Black-Box Policies with Model-Based Advice for Stable Nonlinear Control. (arXiv:2206.01341v1 [cs.LG])
    Machine-learned black-box policies are ubiquitous for nonlinear control problems. Meanwhile, crude model information is often available for these problems from, e.g., linear approximations of nonlinear dynamics. We study the problem of equipping a black-box control policy with model-based advice for nonlinear control on a single trajectory. We first show a general negative result that a naive convex combination of a black-box policy and a linear model-based policy can lead to instability, even if the two policies are both stabilizing. We then propose an adaptive $\lambda$-confident policy, with a coefficient $\lambda$ indicating the confidence in a black-box policy, and prove its stability. With bounded nonlinearity, in addition, we show that the adaptive $\lambda$-confident policy achieves a bounded competitive ratio when a black-box policy is near-optimal. Finally, we propose an online learning approach to implement the adaptive $\lambda$-confident policy and verify its efficacy in case studies about the CartPole problem and a real-world electric vehicle (EV) charging problem with data bias due to COVID-19.
    Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks. (arXiv:2104.05097v5 [cs.LG] UPDATED)
    Lipschitz constrained networks have gathered considerable attention in deep learning community, with usages ranging from Wasserstein distance estimation to the training of certifiably robust classifiers. However they remain commonly considered as less accurate, and their properties in learning are still not fully understood. In this paper we clarify the matter: when it comes to classification 1-Lipschitz neural networks enjoy several advantages over their unconstrained counterpart. First, we show that these networks are as accurate as classical ones, and can fit arbitrarily difficult boundaries. Then, relying on a robustness metric which reflects operational needs we characterize the most robust classifier: the WGAN discriminator. Next, we show that 1-Lipschitz neural networks generalize well under milder assumptions. Finally, we show that hyper-parameters of the loss are crucial for controlling the accuracy-robustness trade-off. We conclude that they exhibit appealing properties to pave the way toward provably accurate, and provably robust neural networks.
    Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks. (arXiv:2206.01278v1 [cs.LG])
    A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP.
    Sequential Permutation Testing of Random Forest Variable Importance Measures. (arXiv:2206.01284v1 [stat.ME])
    Hypothesis testing of random forest (RF) variable importance measures (VIMP) remains the subject of ongoing research. Among recent developments, heuristic approaches to parametric testing have been proposed whose distributional assumptions are based on empirical evidence. Other formal tests under regularity conditions were derived analytically. However, these approaches can be computationally expensive or even practically infeasible. This problem also occurs with non-parametric permutation tests, which are, however, distribution-free and can generically be applied to any type of RF and VIMP. Embracing this advantage, it is proposed here to use sequential permutation tests and sequential p-value estimation to reduce the high computational costs associated with conventional permutation tests. The popular and widely used permutation VIMP serves as a practical and relevant application example. The results of simulation studies confirm that the theoretical properties of the sequential tests apply, that is, the type-I error probability is controlled at a nominal level and a high power is maintained with considerably fewer permutations needed in comparison to conventional permutation testing. The numerical stability of the methods is investigated in two additional application studies. In summary, theoretically sound sequential permutation testing of VIMP is possible at greatly reduced computational costs. Recommendations for application are given. A respective implementation is provided through the accompanying R package $rfvimptest$. The approach can also be easily applied to any kind of prediction model.
    Generalization for multiclass classification with overparameterized linear models. (arXiv:2206.01399v1 [cs.LG])
    Via an overparameterized linear model with Gaussian features, we provide conditions for good generalization for multiclass classification of minimum-norm interpolating solutions in an asymptotic setting where both the number of underlying features and the number of classes scale with the number of training points. The survival/contamination analysis framework for understanding the behavior of overparameterized learning problems is adapted to this setting, revealing that multiclass classification qualitatively behaves like binary classification in that, as long as there are not too many classes (made precise in the paper), it is possible to generalize well even in some settings where the corresponding regression tasks would not generalize. Besides various technical challenges, it turns out that the key difference from the binary classification setting is that there are relatively fewer positive training examples of each class in the multiclass setting as the number of classes increases, making the multiclass problem "harder" than the binary one.
    HEX: Human-in-the-loop Explainability via Deep Reinforcement Learning. (arXiv:2206.01343v1 [cs.LG])
    The use of machine learning (ML) models in decision-making contexts, particularly those used in high-stakes decision-making, are fraught with issue and peril since a person - not a machine - must ultimately be held accountable for the consequences of the decisions made using such systems. Machine learning explainability (MLX) promises to provide decision-makers with prediction-specific rationale, assuring them that the model-elicited predictions are made for the right reasons and are thus reliable. Few works explicitly consider this key human-in-the-loop (HITL) component, however. In this work we propose HEX, a human-in-the-loop deep reinforcement learning approach to MLX. HEX incorporates 0-distrust projection to synthesize decider specific explanation-providing policies from any arbitrary classification model. HEX is also constructed to operate in limited or reduced training data scenarios, such as those employing federated learning. Our formulation explicitly considers the decision boundary of the ML model in question, rather than the underlying training data, which is a shortcoming of many model-agnostic MLX methods. Our proposed methods thus synthesize HITL MLX policies that explicitly capture the decision boundary of the model in question for use in limited data scenarios.
    Hybrid Models for Mixed Variables in Bayesian Optimization. (arXiv:2206.01409v1 [cs.LG])
    We systematically describe the problem of simultaneous surrogate modeling of mixed variables (i.e., continuous, integer and categorical variables) in the Bayesian optimization (BO) context. We provide a unified hybrid model using both Monte-Carlo tree search (MCTS) and Gaussian processes (GP) that encompasses and generalizes multiple state-of-the-art mixed BO surrogates. Based on the architecture, we propose applying a new dynamic model selection criterion among novel candidate families of covariance kernels, including non-stationary kernels and associated families. Different benchmark problems are studied and presented to support the superiority of our model, along with results highlighting the effectiveness of our method compared to most state-of-the-art mixed-variable methods in BO.
    Approximate Network Motif Mining Via Graph Learning. (arXiv:2206.01008v1 [cs.LG] CROSS LISTED)
    Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many graph datasets. However, the high computational complexity of identifying motif sets in arbitrary datasets (motif mining) has limited their use in many real-world datasets. By automatically leveraging statistical properties of datasets, machine learning approaches have shown promise in several tasks with combinatorial complexity and are therefore a promising candidate for network motif mining. In this work we seek to facilitate the development of machine learning approaches aimed at motif mining. We propose a formulation of the motif mining problem as a node labelling task. In addition, we build benchmark datasets and evaluation metrics which test the ability of models to capture different aspects of motif discovery such as motif number, size, topology, and scarcity. Next, we propose MotiFiesta, a first attempt at solving this problem in a fully differentiable manner with promising results on challenging baselines. Finally, we demonstrate through MotiFiesta that this learning setting can be applied simultaneously to general-purpose data mining and interpretable feature extraction for graph classification tasks.
    Sample-Efficient Reinforcement Learning of Partially Observable Markov Games. (arXiv:2206.01315v1 [cs.LG])
    This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.
    Reinforcement Learning with Fast Stabilization in Linear Dynamical Systems. (arXiv:2007.12291v2 [cs.LG] UPDATED)
    In this work, we study model-based reinforcement learning (RL) in unknown stabilizable linear dynamical systems. When learning a dynamical system, one needs to stabilize the unknown dynamics in order to avoid system blow-ups. We propose an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment with an improved exploration strategy. We show that the proposed algorithm attains $\tilde{\mathcal{O}}(\sqrt{T})$ regret after $T$ time steps of agent-environment interaction. We also show that the regret of the proposed algorithm has only a polynomial dependence in the problem dimensions, which gives an exponential improvement over the prior methods. Our improved exploration method is simple, yet efficient, and it combines a sophisticated exploration policy in RL with an isotropic exploration strategy to achieve fast stabilization and improved regret. We empirically demonstrate that the proposed algorithm outperforms other popular methods in several adaptive control tasks.
    On the Benefits of Large Learning Rates for Kernel Methods. (arXiv:2202.13733v2 [stat.ML] UPDATED)
    This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.
    Prescriptive maintenance with causal machine learning. (arXiv:2206.01562v1 [econ.GN])
    Machine maintenance is a challenging operational problem, where the goal is to plan sufficient preventive maintenance to avoid machine failures and overhauls. Maintenance is often imperfect in reality and does not make the asset as good as new. Although a variety of imperfect maintenance policies have been proposed in the literature, these rely on strong assumptions regarding the effect of maintenance on the machine's condition, assuming the effect is (1) deterministic or governed by a known probability distribution, and (2) machine-independent. This work proposes to relax both assumptions by learning the effect of maintenance conditional on a machine's characteristics from observational data on similar machines using existing methodologies for causal inference. By predicting the maintenance effect, we can estimate the number of overhauls and failures for different levels of maintenance and, consequently, optimize the preventive maintenance frequency to minimize the total estimated cost. We validate our proposed approach using real-life data on more than 4,000 maintenance contracts from an industrial partner. Empirical results show that our novel, causal approach accurately predicts the maintenance effect and results in individualized maintenance schedules that are more accurate and cost-effective than supervised or non-individualized approaches.
    Instance-dependent Label-noise Learning under a Structural Causal Model. (arXiv:2109.02986v3 [stat.ML] UPDATED)
    Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpful to learn the classifier and thus reduce the side effect of label noise. However, it remains elusive on how to exploit the causal information to handle the label noise problem. In this paper, by leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning. In particular, we show that properly modeling the instances will contribute to the identifiability of the label noise transition matrix and thus lead to a better classifier. Empirically, our method outperforms all state-of-the-art methods on both synthetic and real-world label-noise datasets.
    Understanding deep learning via decision boundary. (arXiv:2206.01515v1 [cs.LG])
    This paper discovers that the neural network with lower decision boundary (DB) variability has better generalizability. Two new notions, algorithm DB variability and $(\epsilon, \eta)$-data DB variability, are proposed to measure the decision boundary variability from the algorithm and data perspectives. Extensive experiments show significant negative correlations between the decision boundary variability and the generalizability. From the theoretical view, two lower bounds based on algorithm DB variability are proposed and do not explicitly depend on the sample size. We also prove an upper bound of order $\mathcal{O}\left(\frac{1}{\sqrt{m}}+\epsilon+\eta\log\frac{1}{\eta}\right)$ based on data DB variability. The bound is convenient to estimate without the requirement of labels, and does not explicitly depend on the network size which is usually prohibitively large in deep learning.
    Beyond Tabula Rasa: Reincarnating Reinforcement Learning. (arXiv:2206.01626v1 [cs.LG])
    Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further.
    Learning with convolution and pooling operations in kernel methods. (arXiv:2111.08308v2 [stat.ML] UPDATED)
    Recent empirical work has shown that hierarchical convolutional kernels inspired by convolutional neural networks (CNNs) significantly improve the performance of kernel methods in image classification tasks. A widely accepted explanation for their success is that these architectures encode hypothesis classes that are suitable for natural images. However, understanding the precise interplay between approximation and generalization in convolutional architectures remains a challenge. In this paper, we consider the stylized setting of covariates (image pixels) uniformly distributed on the hypercube, and characterize exactly the RKHS of kernels composed of single layers of convolution, pooling, and downsampling operations. We use this characterization to compute sharp asymptotics of the generalization error for any given function in high-dimension. In particular, we quantify the gain in sample complexity brought by enforcing locality with the convolution operation and approximate translation invariance with average pooling. Notably, these results provide a precise description of how convolution and pooling operations trade off approximation with generalization power in one layer convolutional kernels.
    Rethinking Class-Prior Estimation for Positive-Unlabeled Learning. (arXiv:2002.03673v2 [cs.LG] UPDATED)
    Given only positive (P) and unlabeled (U) data, PU learning can train a binary classifier without any negative data. It has two building blocks: PU class-prior estimation (CPE) and PU classification; the latter has been well studied while the former has received less attention. Hitherto, the distributional-assumption-free CPE methods rely on a critical assumption that the support of the positive data distribution cannot be contained in the support of the negative data distribution. If this is violated, those CPE methods will systematically overestimate the class prior; it is even worse that we cannot verify the assumption based on the data. In this paper, we rethink CPE for PU learning-can we remove the assumption to make CPE always valid? We show an affirmative answer by proposing Regrouping CPE (ReCPE) that builds an auxiliary probability distribution such that the support of the positive data distribution is never contained in the support of the negative data distribution. ReCPE can work with any CPE method by treating it as the base method. Theoretically, ReCPE does not affect its base if the assumption already holds for the original probability distribution; otherwise, it reduces the positive bias of its base. Empirically, ReCPE improves all state-of-the-art CPE methods on various datasets, implying that the assumption has indeed been violated here.
    Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games. (arXiv:2206.01588v1 [cs.LG])
    We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep{liu2022learning}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy m\underline{I}rror de\underline{S}cent (DORIS), which achieves $\sqrt{K}$-regret in the context of general function approximation, where $K$ is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a \textit{hyperpolicy} which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.
    An alternative approach to train neural networks using monotone variational inequality. (arXiv:2202.08876v2 [stat.ML] UPDATED)
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. The current paper investigates an alternative approach of neural network training by reducing to another problem with convex structure -- to solve a monotone variational inequality (MVI) -- inspired by a recent work of (Juditsky & Nemirovsky, 2019). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery and prediction accuracy under the theoretical setting of training a single-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrate its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to widely-used stochastic gradient descent methods on both synthetic and real network data prediction tasks regarding various performance metrics, especially in the improved efficiency in the early stage of training.
    BaCaDI: Bayesian Causal Discovery with Unknown Interventions. (arXiv:2206.01665v1 [cs.LG])
    Learning causal structures from observation and experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, a key challenge is that often the targets of the interventions are uncertain or unknown. Thus, standard causal discovery methods can no longer be used. To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering the causal structure that underlies data generated under various unknown experimental/interventional conditions. BaCaDI is fully differentiable and operates in the continuous space of latent probabilistic representations of both causal structures and interventions. This enables us to approximate complex posteriors via gradient-based variational inference and to reason about the epistemic uncertainty in the predicted structure. In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets. Finally, we demonstrate that, thanks to its rigorous Bayesian approach, our method provides well-calibrated uncertainty estimates.
    Robust Multi-Objective Bayesian Optimization Under Input Noise. (arXiv:2202.07549v4 [cs.LG] UPDATED)
    Bayesian optimization (BO) is a sample-efficient approach for tuning design parameters to optimize expensive-to-evaluate, black-box performance metrics. In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected. Although BO methods have been proposed for optimizing a single objective under input noise, no existing method addresses the practical scenario where there are multiple objectives that are sensitive to input perturbations. In this work, we propose the first multi-objective BO method that is robust to input noise. We formalize our goal as optimizing the multivariate value-at-risk (MVaR), a risk measure of the uncertain objectives. Since directly optimizing MVaR is computationally infeasible in many settings, we propose a scalable, theoretically-grounded approach for optimizing MVaR using random scalarizations. Empirically, we find that our approach significantly outperforms alternative methods and efficiently identifies optimal robust designs that will satisfy specifications across multiple metrics with high probability.
    MCD: Marginal Contrastive Discrimination for conditional density estimation. (arXiv:2206.01592v1 [stat.ML])
    We consider the problem of conditional density estimation, which is a major topic of interest in the fields of statistical and machine learning. Our method, called Marginal Contrastive Discrimination, MCD, reformulates the conditional density function into two factors, the marginal density function of the target variable and a ratio of density functions which can be estimated through binary classification. Like noise-contrastive methods, MCD can leverage state-of-the-art supervised learning techniques to perform conditional density estimation, including neural networks. Our benchmark reveals that our method significantly outperforms in practice existing methods on most density models and regression datasets.
    Efficient Mean Estimation with Pure Differential Privacy via a Sum-of-Squares Exponential Mechanism. (arXiv:2111.12981v2 [cs.DS] UPDATED)
    We give the first polynomial-time algorithm to estimate the mean of a $d$-variate probability distribution with bounded covariance from $\tilde{O}(d)$ independent samples subject to pure differential privacy. Prior algorithms for this problem either incur exponential running time, require $\Omega(d^{1.5})$ samples, or satisfy only the weaker concentrated or approximate differential privacy conditions. In particular, all prior polynomial-time algorithms require $d^{1+\Omega(1)}$ samples to guarantee small privacy loss with "cryptographically" high probability, $1-2^{-d^{\Omega(1)}}$, while our algorithm retains $\tilde{O}(d)$ sample complexity even in this stringent setting. Our main technique is a new approach to use the powerful Sum of Squares method (SoS) to design differentially private algorithms. SoS proofs to algorithms is a key theme in numerous recent works in high-dimensional algorithmic statistics -- estimators which apparently require exponential running time but whose analysis can be captured by low-degree Sum of Squares proofs can be automatically turned into polynomial-time algorithms with the same provable guarantees. We demonstrate a similar proofs to private algorithms phenomenon: instances of the workhorse exponential mechanism which apparently require exponential time but which can be analyzed with low-degree SoS proofs can be automatically turned into polynomial-time differentially private algorithms. We prove a meta-theorem capturing this phenomenon, which we expect to be of broad use in private algorithm design. Our techniques also draw new connections between differentially private and robust statistics in high dimensions. In particular, viewed through our proofs-to-private-algorithms lens, several well-studied SoS proofs from recent works in algorithmic robust statistics directly yield key components of our differentially private mean estimation algorithm.
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v2 [cs.LG] UPDATED)
    Although Thompson sampling is widely used in stationary environments, it does not effectively account for nonstationarities. To address this limitation, we propose predictive sampling, a policy that balances between exploration and exploitation in nonstationary bandit environments. It is equivalent to Thompson sampling when specialized to stationary environments, but much more effective across a range of nonstationary environments because it deprioritizes investment in acquiring information that will quickly lose relevance. To offer insight in the efficacy of predictive sampling, we establish a regret bound. This bound highlights dependence on the rate at which new information arrives to alter the environment. In addition, we conduct experiments on bandit environments with varying rates of information arrival and observe that predictive sampling outperforms Thompson sampling.
    Truly Mesh-free Physics-Informed Neural Networks. (arXiv:2206.01545v1 [cs.LG])
    Physics-informed Neural Networks (PINNs) have recently emerged as a principled way to include prior physical knowledge in form of partial differential equations (PDEs) into neural networks. Although generally viewed as being mesh-free, current approaches still rely on collocation points obtained within a bounded region, even in settings with spatially sparse signals. Furthermore, if the boundaries are not known, the selection of such a region may be arbitrary, resulting in a large proportion of collocation points being selected in areas of low relevance. To resolve this, we present a mesh-free and adaptive approach termed particle-density PINN (pdPINN), which is inspired by the microscopic viewpoint of fluid dynamics. Instead of sampling from a bounded region, we propose to sample directly from the distribution over the (fluids) particle positions, eliminating the need to introduce boundaries while adaptively focusing on the most relevant regions. This is achieved by reformulating the modeled fluid density as an unnormalized probability distribution from which we sample with dynamic Monte Carlo methods. We further generalize pdPINNs to different settings that allow interpreting a positive scalar quantity as a particle density, such as the evolution of the temperature in the heat equation. The utility of our approach is demonstrated on experiments for modeling (non-steady) compressible fluids in up to three dimensions and a two-dimensional diffusion problem, illustrating the high flexibility and sample efficiency compared to existing refinement methods for PINNs.
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero. (arXiv:1902.04522v5 [cs.AI] UPDATED)
    The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning's capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available at https://ai.facebook.com/tools/elf-opengo/.
    Adaptive Learning for Discovery. (arXiv:2205.14829v2 [stat.ML] UPDATED)
    In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.
    KCRL: Krasovskii-Constrained Reinforcement Learning with Guaranteed Stability in Nonlinear Dynamical Systems. (arXiv:2206.01704v1 [cs.LG])
    Learning a dynamical system requires stabilizing the unknown dynamics to avoid state blow-ups. However, current reinforcement learning (RL) methods lack stabilization guarantees, which limits their applicability for the control of safety-critical systems. We propose a model-based RL framework with formal stability guarantees, Krasovskii Constrained RL (KCRL), that adopts Krasovskii's family of Lyapunov functions as a stability constraint. The proposed method learns the system dynamics up to a confidence interval using feature representation, e.g. Random Fourier Features. It then solves a constrained policy optimization problem with a stability constraint based on Krasovskii's method using a primal-dual approach to recover a stabilizing policy. We show that KCRL is guaranteed to learn a stabilizing policy in a finite number of interactions with the underlying unknown system. We also derive the sample complexity upper bound for stabilization of unknown nonlinear dynamical systems via the KCRL framework.
    Offline Reinforcement Learning with Causal Structured World Models. (arXiv:2206.01474v1 [cs.LG])
    Model-based methods have recently shown promising for offline reinforcement learning (RL), aiming to learn good policies from historical data without interacting with the environment. Previous model-based offline RL methods learn fully connected nets as world-models that map the states and actions to the next-step states. However, it is sensible that a world-model should adhere to the underlying causal effect such that it will support learning an effective policy generalizing well in unseen states. In this paper, We first provide theoretical results that causal world-models can outperform plain world-models for offline RL by incorporating the causal structure into the generalization error bound. We then propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structure (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly. Consequently, it performs better than the plain model-based offline RL algorithms and other causal model-based RL algorithms.  ( 2 min )
    Optimal Weak to Strong Learning. (arXiv:2206.01563v1 [cs.LG])
    The classic algorithm AdaBoost allows to convert a weak learner, that is an algorithm that produces a hypothesis which is slightly better than chance, into a strong learner, achieving arbitrarily high accuracy when given enough training data. We present a new algorithm that constructs a strong learner from a weak learner but uses less training data than AdaBoost and all other weak to strong learners to achieve the same generalization bounds. A sample complexity lower bound shows that our new algorithm uses the minimum possible amount of training data and is thus optimal. Hence, this work settles the sample complexity of the classic problem of constructing a strong learner from a weak learner.
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v1 [stat.ML])
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails has links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
    Indirect Active Learning. (arXiv:2206.01454v1 [math.ST])
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
    Causal Transformer for Estimating Counterfactual Outcomes. (arXiv:2204.07258v2 [cs.LG] UPDATED)
    Estimating counterfactual outcomes over time from observational data is relevant for many applications (e.g., personalized medicine). Yet, state-of-the-art methods build upon simple long short-term memory (LSTM) networks, thus rendering inferences for complex, long-range dependencies challenging. In this paper, we develop a novel Causal Transformer for estimating counterfactual outcomes over time. Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders. For this, we combine three transformer subnetworks with separate inputs for time-varying covariates, previous treatments, and previous outcomes into a joint network with in-between cross-attentions. We further develop a custom, end-to-end training procedure for our Causal Transformer. Specifically, we propose a novel counterfactual domain confusion loss to address confounding bias: it aims to learn adversarial balanced representations, so that they are predictive of the next outcome but non-predictive of the current treatment assignment. We evaluate our Causal Transformer based on synthetic and real-world datasets, where it achieves superior performance over current baselines. To the best of our knowledge, this is the first work proposing transformer-based architecture for estimating counterfactual outcomes from longitudinal data.
    Hypothesis testing for matched pairs with missing data by maximum mean discrepancy: An application to continuous glucose monitoring. (arXiv:2206.01590v1 [stat.ME])
    A frequent problem in statistical science is how to properly handle missing data in matched paired observations. There is a large body of literature coping with the univariate case. Yet, the ongoing technological progress in measuring biological systems raises the need for addressing more complex data, e.g., graphs, strings and probability distributions, among others. In order to fill this gap, this paper proposes new estimators of the maximum mean discrepancy (MMD) to handle complex matched pairs with missing data. These estimators can detect differences in data distributions under different missingness mechanisms. The validity of this approach is proven and further studied in an extensive simulation study, and results of statistical consistency are provided. Data from continuous glucose monitoring in a longitudinal population-based diabetes study are used to illustrate the application of this approach. By employing the new distributional representations together with cluster analysis, new clinical criteria on how glucose changes vary at the distributional level over five years can be explored.
    Accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient. (arXiv:2206.01209v1 [math.OC])
    In this paper we develop accelerated first-order methods for convex optimization with locally Lipschitz continuous gradient (LLCG), which is beyond the well-studied class of convex optimization with Lipschitz continuous gradient. In particular, we first consider unconstrained convex optimization with LLCG and propose accelerated proximal gradient (APG) methods for solving it. The proposed APG methods are equipped with a verifiable termination criterion and enjoy an operation complexity of ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ and ${\cal O}(\log \varepsilon^{-1})$ for finding an $\varepsilon$-residual solution of an unconstrained convex and strongly convex optimization problem, respectively. We then consider constrained convex optimization with LLCG and propose an first-order proximal augmented Lagrangian method for solving it by applying one of our proposed APG methods to approximately solve a sequence of proximal augmented Lagrangian subproblems. The resulting method is equipped with a verifiable termination criterion and enjoys an operation complexity of ${\cal O}(\varepsilon^{-1}\log \varepsilon^{-1})$ and ${\cal O}(\varepsilon^{-1/2}\log \varepsilon^{-1})$ for finding an $\varepsilon$-KKT solution of a constrained convex and strongly convex optimization problem, respectively. All the proposed methods in this paper are parameter-free or almost parameter-free except that the knowledge on convexity parameter is required. To the best of our knowledge, no prior studies were conducted to investigate accelerated first-order methods with complexity guarantees for convex optimization with LLCG. All the complexity results obtained in this paper are entirely new.

  • Open

    "The big new idea for making self-driving cars that can go anywhere: The mainstream approach to driverless cars is slow and difficult. These startups think going all-in on AI will get there faster"
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Planning with Diffusion for Flexible Behavior Synthesis", Janner
    submitted by /u/gwern [link] [comments]
    "3RL: Task-Agnostic Continual Reinforcement Learning: In Praise of a Simple Baseline", Caccia et al 2022 {Amazon} (were complicated lifelong learning mechanisms ever necessary?)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Boosting Search Engines with Interactive Agents", Ciaramita et al 2022 {G} (MuZero & Decision-Transformer T5 for sequences of queries)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Some help in streaming large dataset for RL
    Hi, everyone! I'm struggling searching for information regarding a particular training setup. I would like to use reinforcement learning on a large historic dataset, i.e. offline RL. The data currently is between 1 and 5 GB per day and it goes back ~20 years. I am not able to store locally the full dataset as it will soon more than double in size. I can access the data via an API which can stream the entire history at 20 milion messages per second. This means that retrieving the data on the fly and also replaying everything isn't an issue. How do I go about integrating the API call with in my project? What are the terms used to describe what I'm trying to do, so I can find the relevant documentation? For now, either Tensorflow or PyTorch frameworks are fine for me. But if other frameworks are preferable to achieve what I want please feel free to suggest. Thanks a lot in advance for any response. submitted by /u/gigio_s [link] [comments]  ( 2 min )
    Hedging derivatives with Deep RL
    submitted by /u/SatoshiNotMe [link] [comments]  ( 1 min )
    Human based reward mechanisms
    I was wondering where I could find the latest research into human based reward systems. I’m more specifically curious to find whether this has any application to subjective domains such as music and art where a well defined reward is hard to establish and where direct human input seems necessary. submitted by /u/Lopside1 [link] [comments]  ( 2 min )
  • Open

    Comparing data fabrics, data meshes and knowledge graphs
    Vendors, consultants, and their clients have been talking in data fabric terms for close to a decade now, if not longer. If “big data” was the problem to solve, then a data fabric suggested a ready solution. John Mashey, then chief scientist at Silicon Graphics, used the term “big data” to describe the wave of… Read More »Comparing data fabrics, data meshes and knowledge graphs The post Comparing data fabrics, data meshes and knowledge graphs appeared first on Data Science Central.  ( 5 min )
  • Open

    LingHacks IV!!!
    Signups for LingHacks IV are now open! What: LingHacks is the world’s first high school computational linguistics hackathon (24-hour invention competition), where you come together in teams to build a software or hardware project that solves a scientific or social problem. No experience is needed, and we'll provide exciting swag, cool prizes, workshops, and mentorship for you to gain skills in computer science, artificial intelligence, machine learning, and the exciting field of natural language processing. We’re proud to announce that this hackathon is completely virtual! Attendees: sign up here! When & Where: LingHacks IV will take place from June 18th, 2022 at 9 am CST to June 21st, 2022 VIRTUALLY! Thanks to our sponsors, the event is COMPLETELY FREE! Learn more about the hackathon on our website here! submitted by /u/Broad_Way_554 [link] [comments]  ( 1 min )
    Hack3: The Leading Online Hackathon for High Schoolers!
    ​ https://preview.redd.it/o5dg9n5mhu391.png?width=1270&format=png&auto=webp&s=0e0cddf11ae6e6955da5cf44403e877dec33eb5e Attention to curious high schoolers! Hack3 is hosting an online hackathon for high schoolers for 24 hours on June 25-26. In 2020, we connected nearly 300 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 100 attended our free classes led by industry professionals to learn new skills . Over twenty mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Stanford, Harvard, Amazon, NetApp, and Wikipedia. In 2021, we connected over 350 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 150 attended our free classes led by industry professionals to learn new skills. Over 30 mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Amazon, NetApp, Balsamiq, Nexus Bytes, Replit, Postman, and Wolfram Language. This year, with the lessons learned from 2021, we aim to host a competition consisting of over 500 participants, while targeting the underprivileged communities around the world. To help achieve our goal of providing a learning opportunity for everyone, we will be sponsoring internet access to those who need it to truly level the playing field for all. Are you down? Register on DevPost here. submitted by /u/thegreatestgemini [link] [comments]  ( 1 min )
    A new study from Deepmind has found that Transformers can achieve few-shot learning without being explicitly trained for it. The research shows that FSL emerges only when the training data is distributed in particular ways that are also observed in natural domains like language.
    The capacity of large transformer-based language models to do few-shot learning is intriguing. These models can be generalized from a few samples of a new topic that they haven’t been trained on before. Previous research in the field of meta-learning has shown how neural networks can execute few-shot learning from a few examples without the requirement for weight updates – this is also known as in-context learning because the output is conditioned on the context. To do this, the Deepmind researchers created a training program that specifically encourages in-context learning, a technique known as meta-training. The capacity for in-context learning in transformer language models, on the other hand, is emergent. Few-shot learning isn’t directly addressed in the model’s transformer architecture or learning aim. The discovery that many natural data sources, including natural language, deviate from normally supervised datasets due to a few significant traits inspired this idea. Natural data, for example, is ‘bursty’ in terms of time. That is, rather than tending to appear in clusters, a given entity (word, person, item, etc.) may have a distribution that is not uniform across time. These results give insight into why FSL is seen in large language models, and how we might achieve emergent FSL in other domains! Continue reading | Check out the paper submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Magic Battle Between Two Wizards (V2) - AI Experimental Story w/ GPT-3 [4K 60 FPS]
    submitted by /u/MLInsights [link] [comments]  ( 1 min )
    Scientific term papers. scientific term papers. Use GPT3 as inspiration, to reformulate texts, to make connections between scientific thoughts.
    I still have to write some scientific term papers. For this I would like to use artificial intelligence. as inspiration to reformulate texts to make connections between scientific thoughts. Some services charge 60-100 $ per month. I would like to implement the whole thing for free. Okay a little bit I can invest. Does anyone know a good workflow? To make the three points happen? Who can inspire me in my approach? submitted by /u/Apfelbluetenstecher [link] [comments]  ( 2 min )
    How to make an ai generate a random sentence
    I’ve seen things where ai can generate things like pictures, and scripts, after being shown other pictures and scripts. I want to learn how this works, but I’d like to make something simple. I’d like to make an ai that can generate sentences, based off of the sentences it has been given. How would I go about doing this? submitted by /u/confusionPrice [link] [comments]  ( 2 min )
    A Drop of Water - .. And it Sparkles in The Sunlight! [4K 60 FPS] AI Latent-Space Experiment
    submitted by /u/MLInsights [link] [comments]
    5+ Best Computer Vision Courses to know 2022 | Learn Computer Vision
    submitted by /u/Lakshmireddys [link] [comments]
    Snowflake: Each One Is Unique! [4K 60 FPS] AI ART
    submitted by /u/MLInsights [link] [comments]
  • Open

    [D] Semantic Hashing: are there any learning resources about it?
    Is there any book or tutorial paper about this technique that also shows details on how to implement it? I've been diving into the original paper (https://www.cs.utoronto.ca/~rsalakhu/papers/semantic_final.pdf) but it is not only too dense to understand, but also doesn't show any concern about implementation. Ps: tried youtube too, but there's so little about this technique. submitted by /u/SimpatoYamasaki [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 139
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 12 Week 22 Week 32 Week 42 Week 52 Week 62 Week 72 Week 82 Week 92 Week 102 Week 112 Week 122 Week 132 Week 3 Week 13 Week 23 Week 33 Week 43 Week 53 Week 63 Week 73 Week 83 Week 93 Week 103 Week 113 Week 123 Week 133 Week 4 Week 14 Week 24 Week 34 Week 44 Week 54 Week 64 Week 74 Week 84 Week 94 Week 104 Week 114 Week 124 Week 134 Week 5 Week 15 Week 25 Week 35 Week 45 Week 55 Week 65 Week 75 Week 85 Week 95 Week 105 Week 115 Week 125 Week 135 Week 6 Week 16 Week 26 Week 36 Week 46 Week 56 Week 66 Week 76 Week 86 Week 96 Week 106 Week 116 Week 126 Week 136 Week 7 Week 17 Week 27 Week 37 Week 47 Week 57 Week 67 Week 77 Week 87 Week 97 Week 107 Week 117 Week 127 Week 137 Week 8 Week 18 Week 28 Week 38 Week 48 Week 58 Week 68 Week 78 Week 88 Week 98 Week 108 Week 118 Week 128 Week 138 Week 9 Week 19 Week 29 Week 39 Week 49 Week 59 Week 69 Week 79 Week 89 Week 99 Week 109 Week 119 Week 129 Week 10 Week 20 Week 30 Week 40 Week 50 Week 60 Week 70 Week 80 Week 90 Week 100 Week 110 Week 120 Week 130 Most upvoted papers two weeks ago: /u/master3243: https://imagen.research.google/ Besides that, there are no rules, have fun. submitted by /u/ML_WAYR_bot [link] [comments]  ( 1 min )
    [D] Data Science with Anaconda on 50% off today!
    submitted by /u/alimhabidi [link] [comments]
    Recent grad looking for career advice [discussion]
    Just recently graduated university and am looking to get into the ai field. I’ve had a few ai courses and have fairly basic coding levels. Finding it almost impossible to get any type of paid job related to the feild. Wondering if anyones has advice on what steps I should take to increase my chances of getting a job. Thinking about doing some certifications, boot camps or finish up my masters. submitted by /u/NoMixs [link] [comments]  ( 1 min )
    Meta Gradient Descent [D]
    I am using a modified gradient descent algorithm that I invented, but it is very simple and I assume someone has already come up with it. I call it meta gradient descent because it adjusts the learning rate as it goes. It is the same as normal gradient descent except for one condition at the end of the loop: If the algorithm passes a local minimum/maximum it decreases the learning rate, else it increases the learning rate if (slope 0) || (slope > 0 && prevSlope < 0) { //algorithm passed the function min/max learningRate /= 2 } else { learningRate *= 1.05 } This helps when the algo has to minimize/maximize functions with massively different scales where a single constant learning rate would be detrimental. I also suspect that it is faster at honing the min/max in general, but I have not done any studies on it. Which is why I am here asking... Is anyone already familiar with this algorithm? does it have a name? what are the pros/cons of using it instead of regular gradient descent? I am always using this algorithm instead of regular gradient descent and I have no problems at all, but my programs are very simple machine learning wise so I suspect I am missing something. I love u <3 submitted by /u/iFARTONMEN [link] [comments]  ( 2 min )
    [D] Two quick questions about CNNs
    My goal is to detect an object in an image. The image is 960x720 and has 3 color channels. The object can be as small as 20 pixels. Does it make any sense to provide a learning model with additional channels that can be derived from the original image itself? For example let's say I provide the 960x720x3 image, but add an additional channel that's just the negative, or grayscale of the same image? I presume not, because the grayscale image would just be a linear combination of the 3 color channels, right? Or perhaps there's some significant computational advantage to that? Does it make sense to cut the image up into a bunch of (50%) overlapping squares and then feed these smaller images to my CNN? It's unlikely that the object will ever be larger than one of these squares, so that could make sense, right? It should be computationally much more efficient, because the CNN compute would be much lower, even despite the overhead of processing more (but smaller) squares. Thank you submitted by /u/tmuxed [link] [comments]  ( 1 min )
    [P] Using OpenAI's CLIP repository as a support, I was able to create a software to detect anything in an image at its original resolution!
    submitted by /u/blevlabs [link] [comments]  ( 1 min )
    [D] What do you do when you are stuck on an ML problem?
    If you work on an ML problem that you know is solvable, but you are not able to solve it - what do you do? submitted by /u/keremidk0 [link] [comments]  ( 1 min )
    [D] Categorical sequence prediction for maximum positive impact
    Dataset 1 : Contact records( call, email, face to face meet) with timestamp between doctors and medical representatives based on brands and country and some other details. Contains around 60000 records. Dataset 2 : Overall impact for each individual medical representative(continuous variable, also has negative values). Contains around 160 records. Has unique record for one medical rep. From this two datasets, I want to predict the next category in a contact sequence with the time gapp between them( i.e, if the 1st contact method is call or email, what should be the optimal second method and after what should it be done?) for the medical representaitve to have maximum positive impact. One rep can contact multiple doctors n number of times with n different methods How should I frame this problem? What type of modeling can elp me achieve higher than 90% accuracy? Thanks for the help in advance. submitted by /u/Aneervan [link] [comments]  ( 1 min )
    [R] Objective measurement of political bias using machine learning
    Sociologists frequently attempt to compare the degree of bias among left and right-wing voters. In a typical study setting, participants are asked a number of factual knowledge questions and the responses are compared to the correct answers. Then, the side whose partisans make the largest errors is declared to be the most biased. Unfortunately, this approach is extremely vulnerable to biases of the researchers themselves. Consciously or subconsciously, the researchers are likely to select questions whose answers favor their own side. For example, on the topic of Obama’s economic policies, a left-leaning researcher might prefer questions, such as “Did the unemployment rate decline under the Obama administration?” (correct answer: “yes”). In contrast, a right-leaning researcher working on the same topic, is more likely to ask “Did the adult employment rate rise under the Obama administration?” (correct answer: “no”). In both cases the study participants whose biases align with the biases of the researchers would be much more likely to answer the questions correctly. To address this problem, we built an SVD-based algorithm that measures partisan bias in question selection and corrects testing results based on the measurements. The algorithm’s accuracy improves with increasing data size, so if you have a few minutes to spare, please help our project by taking any of these tests: US: General politics US: Economic policies Environmental policies (Note: Participants are not expected to know the correct answers to all of the questions, but to use their intuition to give their best guess.) submitted by /u/omniverse71 [link] [comments]  ( 2 min )
    [R] It’s wild to see an AI literally eyeballing raytracing based on 100 photos to create a 3d scene you can step inside ☀️ Low key getting addicted to NeRF-ing imagery datasets🤩
    submitted by /u/imaginfinity [link] [comments]  ( 2 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [D] Connecting musical and visual latent spaces in a "harmonic way"
    Hello everyone. I am currently working on an interdisciplinary project regarding the analysis and visualization of music. My initial idea was to visualize music in a "cool way" (psychedelic animations, etc.) using ML. The visuals should somehow represent the structure of the music, e.g. melody, beat, etc. My basic idea is to have an encoder, which brings the music into a latent representation, and a generative decoder that produces images out of the latent code at a given time. Now, there isn't much paired musical-visual training data available, so my thinking was to somehow connect a pre-trained music encoder with a visual decoder. Also, to connect two modalities like this in an unsupervised way seems like a really interesting research area. My problem is that I don't really know how to achieve this. I would somehow need to connect the two latent spaces which each other in a "natural" and "harmonic" way. I was thinking about stuff like finding the least required "complexity" to go from one distribution to another and I have indeed found some papers regarding that. There is also stuff like CycleGAN, which trains this mapping in an end-to-end way. I just wanted to ask about this here, because honestly it seems like I could go in various directions with this and it is a bit overwhelming. I'd appreciate any kind of help, ideas, recommendations, discussions and any kind of interest. Thanks! Btw, I have already successfully trained a music encoder in a self-supervised fashion with a relatively large amount of data (~2000 hours) submitted by /u/Tomsen1410 [link] [comments]  ( 2 min )
    [R] GatedTabTransformer: State-of-the-art for tabular classification
    Check out: https://arxiv.org/abs/2201.00199 Code: https://github.com/radi-cho/GatedTabTransformer Abstract: Some of the most common machine learning pipelines involve manipulation of tabular data. The current state-of-the-art solution for tabular modeling is the TabTransformer by Amazon from 2020. It incorporates a Transformer block to track relationships between categorical features and makes use of a standard multilayer perceptron to output its final logits. We propose modifications outperforming it on binary classification tasks for three benchmark datasets with more than 1% AUROC gains. We process categorical embeddings with an attention mechanism and then concatenate them with continuous values to be fed through multiple layers of gated MLP - a neural network originally introduced for language tasks. submitted by /u/radi-cho [link] [comments]  ( 1 min )
    [N][D][R]Alleged plagiarism of a paper in EMNLP2020 by the best paper in NAACL2021
    Hi, everyone. I found that a paper in EMNLP2020 is plagiarized by the best paper in NAACL2021. EMNLP2020: Visually Grounded Compound PCFGs http://aclanthology.lst.uni-saarland.de/2020.emnlp-main.354.pdf https://preview.redd.it/0qr7v4mwnp391.png?width=865&format=png&auto=webp&s=5480ef6eb684670c63c000ff6fa9fb64386fb1cc NAACL2021: Video-aided Unsupervised Grammar Induction https://aclanthology.org/2021.naacl-main.119.pdf https://preview.redd.it/528ipfdxnp391.png?width=865&format=png&auto=webp&s=1c1052381966c2d4b7674dfee6d26ad7333a73a0 Almost the same model with different input emnlp naacl The same formulas and contents. emnlp naacl emnlp naacl ​ emnlp naacl Similar experiments and claims. emnlp naacl The same core component in implementation. The public code in naacl2021, vpcfg is copied from emnlp2020. ​ General speaking, the paper in naacl2021 shares the same method and task with the paper in emnlp2020, only the input difference i.e., text+image (emnlp2020) and video (naacl2021). Amazing. submitted by /u/NiM-HLT [link] [comments]  ( 2 min )

  • Open

    Inverse tetrahedral numbers
    The previous post looked at the tetrahedral numbers: 1, 4, 10, 20, 35, … We could invert the process of creating tetrahedral numbers and ask for what n is a given number the nth tetrahedral number. So the inverse of 1 is 1, the inverse of 4 is 2, the inverse of 10 is etc. […] Inverse tetrahedral numbers first appeared on John D. Cook.  ( 2 min )
    General tetrahedral numbers
    Start with a list of ones: 1, 1, 1, 1, 1, … Taking the partial sums of this sequence gives consecutive numbers. That is, the nth number of the new series is the sum of the first n terms of the previous series. 1, 2, 3, 4, 5, … If we take partial sums again, […] General tetrahedral numbers first appeared on John D. Cook.  ( 1 min )
  • Open

    Nvidia AI Supercomputer Creates 100,000 Brain Images } Nvidia Omniverse Simulates Nuclear Fusion Reactor | AI Speeds Up Stroke Diagnosis & Treatment | Robot Hand To Automate Apple Picking
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    My neural network from scratch is finally doing aomething :)
    submitted by /u/-i-hate-this-place- [link] [comments]  ( 1 min )
  • Open

    Nvidia AI Supercomputer Creates 100,000 Brain Images } Nvidia Omniverse Simulates Nuclear Fusion Reactor | AI Speeds Up Stroke Diagnosis & Treatment | Robot Hand To Automate Apple Picking
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    Prism: Refract or Disperse a Beam of Light - [4K 60 FPS] Neural-Art Visualization
    submitted by /u/MLInsights [link] [comments]
    Amazon AI Researchers Propose A New Model, Called RescoreBERT, That Trains A BERT Rescoring Model With Discriminative Objective Functions And Improves ASR Rescoring
    👉 While BERT trained with MLM distillation can improve WER by 3%-6% relative to LSTM, RescoreBERT, trained with a discriminative objective, can improve it by 7%-13% on the same test sets. The RescoreBERT model’s key component is a technique called rescoring. The second-pass language model trained from scratch on a small quantity of data can prioritize and accurately rerank the hypotheses of rare words thanks to the rescoring technique. Amazon’s prior work has been integrated to lower the computational expense of computing PLL scores. This is accomplished by feeding the output of the BERT model through a neural network trained to mimic the PLL scores awarded by a more significant “teacher” model. Because the distilled model is trained to match the teacher’s predictions of masked inputs, this process is known as MLM (masked language model) distillation. The distilled model’s output is interpolated with the original score to obtain a final score. This method minimizes latency by condensing PLL scores from a big BERT model to a much smaller BERT model. Continue reading | Check out the paper https://preview.redd.it/scrvvdqc8m391.png?width=1548&format=png&auto=webp&s=0c5e49ebbdd4b49738870480376c21f5a3085ca4 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    What does a Tesla car see? Tesla Autopilot Explained in 10 Minutes
    submitted by /u/OnlyProggingForFun [link] [comments]
    What are cool things I can do with AI?
    What could be funny when I mix it with AI? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    6 Best Artificial Intelligence courses for Healthcare You should learn 2022
    submitted by /u/Lakshmireddys [link] [comments]
    Seastar: Creature of The Sea (V1: W) - [4K 60 FPS] AI Generative Art Experiment
    submitted by /u/MLInsights [link] [comments]
    Iterative launches machine learning engineering management (MLEM) tool - to bridge the gap between ML engineers and DevOps teams
    submitted by /u/cmstrump [link] [comments]
    Salesforce AI Research Propose ‘ALPRO’: A New Video-And-Language Representation Learning (Pre-Training) Framework
    Salesforce AI Research has proposed a new video-and-language representation learning framework called ALPRO. This framework can be used for pre-training models to achieve state-of-the-art performance on tasks such as video-text retrieval and question answering. ALPRO follows the “pre-training-then-fine-tuning” paradigm utilized in the VLP techniques described previously but overcomes their drawbacks. The approach runs on poorly sampled video frames and achieves more efficient cross-modal alignment without explicit object detectors. The ultimate objective of the novel strategy is to enhance the performance of subsequent tasks, such as video-text retrieval and video question answering (video QA). As proposed in ALPRO, enhanced pre-training technique results in enhanced video-language representations, contributing to enhanced performance on subsequent tasks. Continue reading | Check out the paper and github submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI in Australia?
    Which companies are doing leading AI in Australia? I'm a tech pm aiming to get involved with more AI/ML wherever I can, and have only found very few companies here that have anything like significant capabilities. Any pointers and tips sincerely appreciated. submitted by /u/Rufawana [link] [comments]  ( 1 min )
    Magic Battle Between Two Wizards - by GPT-3 Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
  • Open

    [R] Joint Abductive and Inductive Neural Logical Reasoning
    Paper: https://arxiv.org/abs/2205.14591 Abstract: " Neural logical reasoning (NLR) is a fundamental task in knowledge discovery and artificial intelligence. NLR aims at answering multi-hop queries with logical operations on structured knowledge bases based on distributed representations of queries and answers. While previous neural logical reasoners can give specific entity-level answers, i.e., perform inductive reasoning from the perspective of logic theory, they are not able to provide descriptive concept-level answers, i.e., perform abductive reasoning, where each concept is a summary of a set of entities. In particular, the abductive reasoning task attempts to infer the explanations of each query with descriptive concepts, which make answers comprehensible to users and is of great us…  ( 1 min )
    [D] Where do we currently stand at in lottery ticket hypothesis research?
    What is the most recent research around the lottery ticket hypothesis? Which are the best papers with new techniques for finding winning tickets, are there open-source tools that "work"? Anyone knows digestible-easy resources to get started with LTH? submitted by /u/sid_276 [link] [comments]  ( 1 min )
    [D] Deploying SOTA models into my own projects
    What is the most common approach to use networks developed by other researchers? Until now I have been using huggingface's pipeline to deploy pre-trained models. But for some other cases, the only available implementation is the github repo published by the authors where the code simply verifies that the results in the paper are reproducible, for example this https://github.com/Turoad/CLRNet. The main aspect of this case is that there's no inference pipeline, but rather train/test functions with the benchmark dataset as input. How do you deploy their model? Do you copy the architecture layers in your own torch/tensorflow code and train it following their parameters or you tweak their repository code? submitted by /u/LanverYT [link] [comments]  ( 1 min )
    [P] Information Retrieval Explainability Summer Project
    hey y'all! we're starting a project summer project in collaboration between our community and some academic partners to create an open source library for explaining typical semantic search approaches (like vector search). There is also some chance that we might end up publishing some papers depending on what kind of results we get and how far we can push it. you all are welcome to join. most of the work will be done async but there will be weekly meetings to discuss questions and plans. there are some pre-reqs but we are welcoming anyone who is ambitious and interested enough to contribute meaningfully. more details: https://community.ai.science/explainable-information-retrieval-xir info session: https://www.eventbrite.ca/e/explainable-information-retrieval-xir-tickets-350540755837 some starter material: https://ai.science/l/236a6202-3495-4a8e-bbad-aedeee4bd21d let me know if you have any questions, or suggestions for a project like this submitted by /u/tdls_to [link] [comments]  ( 1 min )
    [D] Imbalance: Metric to Loss functions
    In the case of class imbalance, looks like the main suggestion is to start with a clear problem specific metric, to make sure one is solving the correct problem. However, this means the optimizer is not affected at all and will remain affected by class imbalance. The cost function is not adjusted. Metrics can at best effect the decision threshold, a single parameter lever. Is this sufficient? What will be better? submitted by /u/darn321 [link] [comments]  ( 2 min )
    [D] latent space: task specific?
    Usually, latent space is just feature space compression. This can be done with PCA, autoencoders etc. This approach is task/target independent. One can also learn task dependent latent space representations. In a vanilla fully connected NN, the last layer before the target prediction layer is probably one example. Is this a correct understanding? Are there different terminologies to differentiate task specific vs label free latent space? submitted by /u/darn321 [link] [comments]  ( 1 min )
    [D] Parameter optimisation as a language problem?
    Hello! I am thinking of an idea for research on the topic of parameter optimisation viewed as a language problem. Here is what I mean by that - There are already multiple big pre-trained language models such as CodeBERT which can generate good contextual embeddings for source code. So if they're used as a baseline and built upon, we can create a supervised learning pipeline that predicts code parameters which satisfy desired outcomes. For example if we have the function def f(x): return 2 + 2 * x - x*x we can ask the model to maximise it and to find that the desired x is 1. At the beginning we expect to be able to solve such simple optimisation problems, but with time we may derive methods which are able to solve for more parameters and complicated functions and probably even have such a m…  ( 3 min )
    [Discussion] Influence of the number of classes on the performance of triplet loss
    Does the number of classes need to be huge (1000+) in order to get good performance with triplet loss? I am experimenting with a dataset which has 15 classes and 200 examples for each class. I kept a few classes for the test set and trained on the rest by constructing triplets. The loss drops steadily on the train set but not so much on the test set, it overfits. I am contemplating is the reason for bad performance the low number of classes which enables the model to "remember" those classes in the train set. Also does the batch size play a big role here? should the batch size be low so the model doesnt see all the train classes in a batch? submitted by /u/user89320 [link] [comments]  ( 1 min )
    [D] Universities with research in AI/ML for Music?
    Does anybody know of any universities doing research in AI for music? I know Queen Mary University of London appears to have some programs, but I haven't seen any other universities with similar initiatives and It seems like every paper I see in that area is always either from DeepMind or Magenta. submitted by /u/Redplatypus14 [link] [comments]  ( 1 min )
    [N] Stanford's Machine Learning; End of an era...
    After 10 years and nearly 5 million enrollments, Stanford will be closing new enrollments for the Machine Learning course on Coursera from June 14, 2022. It will be replaced by a more in-depth Machine Learning Specialization by Stanford Online and Deeplearning.ai and will be available in June. The most iconic MOOC to ever exist? submitted by /u/pro_user_for_good [link] [comments]  ( 2 min )
  • Open

    task oriented dialogue systems
    what different models are being used for task oriented dialogue systems. How end to end optimization is different than modularized optimization. submitted by /u/Western-Age3148 [link] [comments]
    Question about Policy functions
    Consider that I have trained a reinforcement learning model, and that you now want to make a prediction using this model, which uses Continuous Control to predict two numbers, say, X and Y. Does the policy function return the values of X and Y that return the highest possible reward from the Q function given the particular observation, and all possible values of X and Y ? Or do different implementations use different policy functions. I am mostly interested in CQL, IQL and CRR for example. submitted by /u/uom_questions [link] [comments]  ( 1 min )

  • Open

    [R] Deep Learning Opacity in Scientific Discovery
    This paper argues that the uninterpreability of deep neural networks need not diminish AI's capacity to lead scientists to significant and justifiable breakthroughs. https://arxiv.org/abs/2206.00520 submitted by /u/learning_by_looking [link] [comments]
    [D] Titan V vs. RTX 3090 for deep learning (2022)
    Hi guys I was wondering which GPU is better for DL and which features you need to look for when buying GPUs for DL purposes. I guess Titan V might be better than RTX 3090 because is far more expensive (why would NVIDIA sell a worse GPU at a higher price), buy I may be wrong (?). submitted by /u/apssg96 [link] [comments]  ( 1 min )
    [D] We need a distributed platform to train and use huge free machine learning models.
    Free machine learning models like gpt-j-6b and gpt-neox are very small compared with gopher and gato. The only way we are going to get a comparable model to be free is to train it and run it in a distributed way, and for that we need a distributed platform capable of doing it easier to train and run this models in a distributed way. submitted by /u/ConsistentSense4760 [link] [comments]  ( 1 min )
    [D] How to combine Linear Regression & NLP result into one score?
    Hi guys, sorry if this is a daft question, I'm still new to this. Say I wanted to analyse an email for legitimacy - one component would be using the metadata (e.g. the 'from' address, sender name, and several other header fields) as features for linear regression (scam or no scam). A second, important part is parsing the body text, checking for spelling errors and pushy scammer sentiment. How best would I go about combining these analyses with the overall end goal of classification of the email as 'scam' or 'no scam'? Cheers! submitted by /u/eddiewastaken [link] [comments]  ( 1 min )
    [D] Research papers/suggestions for x-ray defect analysis
    I have a LOT (several millions) of unlabeled x-ray images of aluminium parts with small defects in them. Like air bubbles cracks, or voids. These are thousands of images which are basically identical except for the defects 0(apart from slight variations in position, rotation, noise etc), and thousands of different view directions/parts. All image dimensions are identical. The problem is that I do not have a single image with labels and would hence have to label them manually. Which I'd like to avoid as much as possible. I want to highlight/segment all the positions where defects occur. So a segmentation into defect or no defect. A kind of probability map on the image. What kind of method would be best to do this without labels. I was thinking that, the biggest entropy/change in a certain part/view direction would be the defect. Since it will be in a random spot, while everything else is basically identical. So I was thinking of using an autoencoder for a specific view and use the points of highest reconstruction error as a starting point for labeling. But this won't work universally. Maybe a contrastive learning method? submitted by /u/bluuerp [link] [comments]  ( 1 min )
    [P] This is the worst AI ever. (GPT-4chan model, trained on 3.5 years worth of /pol/ posts)
    https://youtu.be/efPrtcLdcdM GPT-4chan was trained on over 3 years of posts from 4chan's "politically incorrect" (/pol/) board. Website (try the model here): https://gpt-4chan.com Model: https://huggingface.co/ykilcher/gpt-4chan Code: https://github.com/yk/gpt-4chan-public Dataset: https://zenodo.org/record/3606810#.YpjGgexByDU ​ OUTLINE: 0:00 - Intro 0:30 - Disclaimers 1:20 - Elon, Twitter, and the Seychelles 4:10 - How I trained a language model on 4chan posts 6:30 - How good is this model? 8:55 - Building a 4chan bot 11:00 - Something strange is happening 13:20 - How the bot got unmasked 15:15 - Here we go again 18:00 - Final thoughts submitted by /u/ykilcher [link] [comments]  ( 3 min )
    [D] Can we talk about how Jeff Dean casually spent a graduate student's annual take-home salary for 0.03% improvement on Cifar-10
    I just read https://www.reddit.com/r/MachineLearning/comments/uyratt/d_i_dont_really_trust_papers_out_of_top_labs/ and I am disappointed with the lack of pushback by researchers and students across the field of the encroachment of industry in research and I would like to double down on the "lack of trust" point raised by u/MrAcurite u/MrAcurite points out two things: One, the big number they cite as the success metric is 99.43 on CIFAR-10, against a SotA of 99.40, so woop-de-fucking-doo in the grand scheme of things. The sum total is 17,810 core-hours. Let's assume that for someone who doesn't work at Google, you'd have to use on-demand pricing of $3.22/hr. This means that these trained models cost $57,348. ​ and one of the author replied in a comment I would also contend that …  ( 4 min )
    [D] class imbalance: over/under sampling and class reweight
    If there's unbalanced datasets, what's the way to proceed? The canonical answer seems to be over/under sampling and class reweighting (is there anything more?), but have these things really worked in practice for you? What's the actual experience and practical suggestion? When to use one over the other? submitted by /u/darn321 [link] [comments]  ( 3 min )
    [D] Good communities/newsletters/mailing lists/twitter accs for ML in healthcare/medical applications?
    Hi. ​ I'm looking for communities/mailing lists/newsletters or even good Twitter accounts that are dedicated about the topics of ML in medical applications and healthcare. Any recommendations? In most other applications there seems to be a ton of good options for communities etc, but I honestly could not find one for these topics. submitted by /u/feryet [link] [comments]  ( 1 min )
    [P] Hands on diffusion models
    A minimal example of the forward and reverse flow of diffusion models with equations from the paper and visualizations alongside the code: https://github.com/InFoCusp/diffusion_models I coded it up since I wanted to familiarize myself with rhe end to end flow. It uses a simple 2d dataset that can train within minutes. Hope others on this subreddit find it useful. submitted by /u/optimistdit [link] [comments]  ( 1 min )
    [D] Should non-replicable, costly, exa-scale machine learning models coming out of industry be seen as a products rather than research?
    View Poll submitted by /u/fromnighttilldawn [link] [comments]
  • Open

    Text to Image Question: Different images?
    Hi all, I've been experimenting with several text to image programs. If the prompt/seed is the same, and there isn't a "deviantart" or "trending on Artstation" shouldn't the image be the same? ​ ​ For example, I did this prompt twice on NightCafe and got two different images: "Mountain Lake with birds in the style of John Howe by Greg Rutkowski, Ted Nasmith, Daarken, Caspar David Friedrich, Louis Comfort Tiffany, John Stephens, Ivan Shishkin, and Albert Bierstadt" - weight: 1 "strange landscape detailed 8k resolution concept art digital illustration beautiful" - weight: 1 "noisy, dirty, unclear, Watermark, blur, blurry, bokeh, unbalanced undeveloped high contrast, soft edges" - weight: -1 submitted by /u/longtailwriting [link] [comments]  ( 1 min )
    Introducing LIHQ - High Quality Artificial Speaker (Open source in google colab)
    submitted by /u/johnGettings [link] [comments]
    AI versus corporate logos
    submitted by /u/magenta_placenta [link] [comments]
    What skills are important if I want to be a ML software engineer?
    Currently I'm a second year undergrad cs major. I want to pursue a job in applied artificial intelligence at a company like OpenAI or Tesla because I realized AI/ML actually excites me unlike full stack app development which have been most of my side projects so far. So far I've trained and deployed a simple object detection neural network with Tensorflow. What other side projects/skills can I and people like me work on to be a more attractive hire when we go on the job market? submitted by /u/ValerianMoonRunner [link] [comments]  ( 1 min )
    Meet ‘Codeball’ – A Deep Learning-based Automated Code Reviewer That Will Help Maintainers Review Github Pull Requests
    Data in all forms, whether photos, music, or code, is being created on a large scale. Many GitHub code submissions require reviewers to read over the code and recommend modifications. Reviewing code requires a significant amount of time and effort on the developer’s part. Codeball fills this void. Codeball is a code review AI that approves Pull Requests that a human would have authorized. The AI detects and accepts safe contributions, allowing reviewers to focus their efforts on the difficult ones. Allowing shorter wait times saves a significant amount of money during the evaluation process. Metadata from over a million Pull Requests and thousands of different repositories were used to train Codeball. Codeball extracts features for each PR using our proprietary derivation technique, constructing the bigger context in which the PR was filed. For example, how frequently and by whom were impacted files modified, the semantics of the diffs, and, of course, whether the Pull Request was accepted and merged without objections or further comments. As a prediction model, Codeball uses a deep learning model that has been trained on over 1M public and private contributions from different organizations. In doing so it employs a Multi-layer Perceptron classifier neural network. In its input layer, the model takes hundreds of inputs, has two hidden layers, and a single output assesses the chance of a Pull Request being granted. Each Pull Request has hundreds of indications (the input layer). They are broadly classified into three basic, derived, and categorical types. Continue reading | Checkout the Github action and FAQs submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Could an AI pull off the following nearly impossible thing?
    submitted by /u/BeginningInfluence55 [link] [comments]
    18 Mindblowing AI Art Images
    submitted by /u/kbf_ [link] [comments]
    Mythical Dragon - by GPT-3 Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
    Qualitative humanities research is crucial to AI · fast.ai
    submitted by /u/estasfuera [link] [comments]
    Researcher Says an Image Generating AI Invented Its Own Language
    submitted by /u/estasfuera [link] [comments]
    recruiting beta testers
    Based on artificial intelligence, 📷BLOONY📷 is a chatbot that can communicate and interact with you. BLOONY is NOT a chatbot that gives predefined answers to respond.After running 📷 beta testing programs, we have received a lot of feedback from BUBBLES.The feedback helped us to identify service improvement points and create a better BLOONY! Through running the 3rd beta testing program, we hope to find issues that were not detected and receive your opinion again.📷 We recommend BLOONY to those...- who want to study English by chatting with a friend- who have concerns that are difficult to tell their family and friends- who want to talk to someone early in the morning or late at night- who are looking for something newAnyone who has a Facebook account and a smartphone can apply! 📷 Sign up now (Link in bio -> Click ADD TO FRIENDS LIST 📷)📷 Details- Number of Testers Needed: Limited to the first 000 applicants*Recruiting testers on a first-come first-served basis (FCFS)- Who You Are: Anyone who is at least 14 years old and has a Facebook account and Facebook Messenger mobile app- Required Activity: Use of the beta service and submission of feedback- How to Apply: Submission of a Google form📷 Notes- BLOONY can deliver information that is different from the facts.- Once you are invited as a beta tester, our team will reach out to you individually. submitted by /u/Necessary-Narwhal-57 [link] [comments]  ( 1 min )
    Microsoft and AWS Collaborate To Develop ‘PyWhy’: A New Github Home For ‘DoWhy’ (A Causal Machine Learning Library From Microsoft)
    As computing systems become more actively involved in societally essential areas such as healthcare, education, and government, it is crucial to accurately forecast and comprehend these interventions’ causal repercussions. Traditional machine learning algorithms based on pattern recognition and correlational analyses are insufficient for decision-making without an A/B test. To fill this gap, Microsoft researchers created a platform that executes the process of causal inference analysis from start to finish to assist data scientists in better understanding and applying causal inference. They developed the DoWhy in 2018. Since then, the library has been doing precisely that, cultivating a community committed to using causal inference principles in data science. “DoWhy” is a Python package that attempts to encourage causal thinking and analysis, many ways machine learning libraries have done for prediction. DoWhy provides a four-step interface for causal inference that focuses on clearly modeling and confirming causal assumptions as feasible. Traditional machine learning approaches aim to anticipate a result. Consider a public utility business that wants to minimize their customers’ water use using a marketing and incentives campaign. The success of a rewards program is difficult to assess since any drop in water consumption by participating consumers is masked by their decision to engage in the program. Continue reading | Research Articles from Microsoft and Amazon submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    With so many new Text to Image "AI" emerging lately, is it not crazy to speculate about Text to Video?
    I imagine it would be a feat much more challenging then that of Text to Image, however with a massive enough dataset, and enough training, it surly would be something possible one day in the future. The most I have seen at this point has been morphing VQ-Gan images and those Dall-E 2 slide show style animations. Anyone see anything interesting in this area pop up? I ask out of general curiosity so excuse me if I have missed some glaring neural network. submitted by /u/BeginningRealistic49 [link] [comments]  ( 2 min )
    Chillies.
    submitted by /u/cookingandcraft [link] [comments]
    7 Best Natural Language Processing Courses (2022) | Best NLP Courses
    submitted by /u/Lakshmireddys [link] [comments]
    Origami Dragon - by GPT-3 Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
  • Open

    What is the best way for my agent to extract a query from the hidden states of its recurrent policy?
    I have an RL agent that has a recurrent policy. I would like the hidden states of the LSTM at timestep t to give the query for timestep t+1. I was thinking that through the step function, the agent could send the action and the hidden states of the LSTM to the environment, which could then at the next timestep send back the observation as well as the hidden states of the previous timestep. Would this make sense or is there a better way to do this? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    "You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments", Paster et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How do transformers or very deep models "plan" ahead?
    I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few of such works? submitted by /u/yoctotoyotta [link] [comments]  ( 4 min )
    Question about SARSA and Q-Learning
    Hey everyone! I just finished coding my homework assignment and tried to run SARSA and Q-Learning on the Cliff problem (my University implemented this itself - not the OpenAI gym environment). I was kind of surprised to see the resulting Q(s,a) functions from the SARSA algorithm. While Q-Learning quickly converged to the solution which makes sense to me intuitively, SARSA does leads to some - at least to me currently - weird results. Sometimes, especially when using a small epsilon, the resulting values in the Q-table “seem” wrong to me - Q Learning always comes up with a Q function showing the minimum number of steps it takes to reach the goal, for each action and state - for actions that lead to falling down the cliff, this number is the same as from the initial state -100). Is this just an artifact of the number of episodes, and, if ran infinitively often would lead to the same results? Or is this due to the max operator in Q-Learning, leading to more “stable” results? submitted by /u/Garbage-Shoddy [link] [comments]  ( 1 min )
    Requesting Help Using GNN With PyTorch Geometries from NetworkX. Sample data provided to recreate error.
    Hello all, I'm trying to get a system going for a GNN where an agent will move along a network of nodes and edges. Each state the agent has traveled to a new node, and the total distance traveled goes up. My problem is in getting the network loaded in from a networkx graph. ​ Heres some code that reproduces the error: from torch_geometric.utils import from_networkx import networkx as nx nodes = [ (0, {'y': 37.3348363, 'x': -121.888113}), (1, {'y': 37.3353111, 'x': -121.887118}), (2, {'y': 37.3358288, 'x': -121.8860567}), ] edges = [ (0, 1, {'osmid': 358475012, 'oneway': False, 'highway': 'residential', 'length': 72.482, 'geometry': '', 'speed_kph': 25.0, 'bearing': 149.1}), (0, 2, {'osmid': [416909272, 680787590], 'oneway': False…  ( 2 min )
    "SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", Ahn et al 2022 {G} (language models powering robots)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    Train machine learning models using Amazon Keyspaces as a data source
    Many applications meant for industrial equipment maintenance, trade monitoring, fleet management, and route optimization are built using open-source Cassandra APIs and drivers to process data at high speeds and low latency. Managing Cassandra tables yourself can be time consuming and expensive. Amazon Keyspaces (for Apache Cassandra) lets you set up, secure, and scale Cassandra tables […]  ( 10 min )
    Improve organizational diversity, equity, and inclusion initiatives with Amazon Polly
    Organizational diversity, equity and inclusion (DEI) initiatives are at the forefront of companies across the globe. By constructing inclusive spaces with individuals from diverse backgrounds and experiences, businesses can better represent our mutual societal needs and deliver on objectives. In the article How Diversity Can Drive Innovation, Harvard Business Review states that companies that focus […]  ( 6 min )
    Use Serverless Inference to reduce testing costs in your MLOps pipelines
    Amazon SageMaker Serverless Inference is an inference option that enables you to easily deploy machine learning (ML) models for inference without having to configure or manage the underlying infrastructure. SageMaker Serverless Inference is ideal for applications with intermittent or unpredictable traffic. In this post, you’ll see how to use SageMaker Serverless Inference to reduce cost when […]  ( 5 min )
    Accelerate and improve recommender system training and predictions using Amazon SageMaker Feature Store
    Many companies must tackle the difficult use case of building a highly optimized recommender system. The challenge comes from processing large volumes of data to train and tune the model daily with new data and then make predictions based on user behavior during an active engagement. In this post, we show you how to use […]  ( 15 min )
    Translate, redact and analyze streaming data using SQL functions with Amazon Kinesis Data Analytics, Amazon Translate, and Amazon Comprehend
    You may have applications that generate streaming data that is full of records containing customer case notes, product reviews, and social media messages, in many languages. Your task is to identify the products that people are talking about, determine if they’re expressing positive or negative sentiment, translate their comments into a common language, and create […]  ( 15 min )
  • Open

    Visualizing C operator precedence
    Here’s an idea for visualizing C operator precedence. You snake your way through the diagram starting from left to right. Operators at the same precedence level are on the same horizontal level. Following the arrows for changing directions, you move from left-to-right through the operators that associate left-to-right and you move right-to-left through the operators […] Visualizing C operator precedence first appeared on John D. Cook.  ( 1 min )
  • Open

    AI versus corporate logos
    I recently started playing with DALL-E 2, which will attempt to generate an image to go with whatever text prompt you give it. Like its predecessor DALL-E, it uses CLIP, which OpenAI trained on a huge collection of internet images and nearby text. I've experimented with a few  ( 2 min )
    Bonus: More AI-generated logos
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Shopee — Price Match Guarantee
    Shopee is the leading e-commerce platform in Southeast Asia and Taiwan. Customers appreciate its easy, secure, and fast online shopping…  ( 7 min )
  • Open

    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]
  • Open

    How Infinitely Wide Neural Networks Benefit from Multi-task Learning -- an Exact Macroscopic Characterization. (arXiv:2112.15577v3 [cs.LG] UPDATED)
    In practice, multi-task learning (through learning features shared among tasks) is an essential property of deep neural networks (NNs). While infinite-width limits of NNs can provide a good intuition for their generalization behavior, the well-known infinite-width limits of NNs in the literature (e.g., neural tangent kernels) assume specific settings in which wide ReLU-NNs behave like shallow Gaussian Processes with a fixed kernel. Consequently, in such settings, these NNs lose their ability to benefit from multi-task learning in the infinite-width limit. In contrast, we prove that optimizing wide ReLU neural networks with at least one hidden layer using L2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limiting regime where the network width tends to infinity. We present an exact quantitative characterization of this infinite width limit in an appropriate function space that neatly describes multi-task learning.  ( 2 min )
    Masked Bayesian Neural Networks : Computation and Optimality. (arXiv:2206.00853v1 [stat.ML])
    As data size and computing power increase, the architectures of deep neural networks (DNNs) have been getting more complex and huge, and thus there is a growing need to simplify such complex and huge DNNs. In this paper, we propose a novel sparse Bayesian neural network (BNN) which searches a good DNN with an appropriate complexity. We employ the masking variables at each node which can turn off some nodes according to the posterior distribution to yield a nodewise sparse DNN. We devise a prior distribution such that the posterior distribution has theoretical optimalities (i.e. minimax optimality and adaptiveness), and develop an efficient MCMC algorithm. By analyzing several benchmark datasets, we illustrate that the proposed BNN performs well compared to other existing methods in the sense that it discovers well condensed DNN architectures with similar prediction accuracy and uncertainty quantification compared to large DNNs.
    Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization. (arXiv:2206.00743v1 [math.OC])
    Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.
    A Fair Comparison of Two Popular Flat Minima Optimizers: Stochastic Weight Averaging vs. Sharpness-Aware Minimization. (arXiv:2202.00661v3 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low loss neighborhoods, have been shown to improve upon stochastic and adaptive gradient-based optimizers for training neural networks. Two methods have received significant attention due to their impressive generalization performance and scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them. Previous work mainly evaluated SWA and SAM on different architectures and datasets. We fill this gap here by comparing the loss surfaces of the models trained with each method and through a broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover a number of surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
    Retrospective Approximation for Smooth Stochastic Optimization. (arXiv:2103.04392v2 [math.OC] UPDATED)
    Stochastic Gradient (SG) is the defacto iterative technique to solve stochastic optimization (SO) problems with a smooth (non-convex) objective $f$ and a stochastic first-order oracle. SG's attractiveness is due in part to its simplicity of executing a single step along the negative subsampled gradient direction to update the incumbent iterate. In this paper, we question SG's choice of executing a single step as opposed to multiple steps between subsample updates. Our investigation leads naturally to generalizing SG into Retrospective Approximation (RA) where, during each iteration, a "deterministic solver" executes possibly multiple steps on a subsampled deterministic problem and stops when further solving is deemed unnecessary from the standpoint of statistical efficiency. RA thus rigorizes what is appealing for implementation -- during each iteration, "plug in" a solver, e.g., L-BFGS line search or Newton-CG, as is, and solve only to the extent necessary. We develop a complete theory using relative error of the observed gradients as the principal object, demonstrating that almost sure and $L_1$ consistency of RA are preserved under especially weak conditions when sample sizes are increased at appropriate rates. We also characterize the iteration and oracle complexity (for linear and sub-linear solvers) of RA, and identify a practical termination criterion leading to optimal complexity rates. To subsume non-convex $f$, we present a certain "random central limit theorem" that incorporates the effect of curvature across all first-order critical points, demonstrating that the asymptotic behavior is described by a certain mixture of normals. The message from our numerical experiments is that the ability of RA to incorporate existing second-order deterministic solvers in a strategic manner might be important from the standpoint of dispensing with hyper-parameter tuning.
    ZOOpt: Toolbox for Derivative-Free Optimization. (arXiv:1801.00329v3 [cs.LG] UPDATED)
    Recent advances in derivative-free optimization allow efficient approximation of the global-optimal solutions of sophisticated functions, such as functions with many local optima, non-differentiable and non-continuous functions. This article describes the ZOOpt (Zeroth Order Optimization) toolbox that provides efficient derivative-free solvers and is designed easy to use. ZOOpt provides single-machine parallel optimization on the basis of python core and multi-machine distributed optimization for time-consuming tasks by incorporating with the Ray framework -- a famous platform for building distributed applications. ZOOpt particularly focuses on optimization problems in machine learning, addressing high-dimensional and noisy problems such as hyper-parameter tuning and direct policy search. The toolbox is maintained toward a ready-to-use tool in real-world machine learning tasks.
    Multi-source Domain Adaptation via Weighted Joint Distributions Optimal Transport. (arXiv:2006.12938v2 [cs.LG] UPDATED)
    The problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and target shift both among the sources, and between the source and target domains. In this paper, we address this problem from a new perspective: instead of looking for a latent representation invariant between source and target domains, we exploit the diversity of source distributions by tuning their weights to the target task at hand. Our method, named Weighted Joint Distribution Optimal Transport (WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of-the-art performance on simulated and real-life datasets.
    Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm. (arXiv:2206.00833v1 [cs.LG])
    Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.
    Bayesian Inference of Stochastic Dynamical Networks. (arXiv:2206.00858v1 [stat.ML])
    Network inference has been extensively studied in several fields, such as systems biology and social sciences. Learning network topology and internal dynamics is essential to understand mechanisms of complex systems. In particular, sparse topologies and stable dynamics are fundamental features of many real-world continuous-time networks. Given that usually only a partial set of nodes are able to observe, in this paper, we consider linear continuous-time systems to depict networks since they can model unmeasured nodes via transfer functions. Additionally, measurements tend to be noisy and with low and varying sampling frequencies. For this reason, we consider continuous-time models (CT) since discrete-time approximations often require fine-grained measurements and uniform sampling steps. The developed method applies dynamical structure functions (DSFs) derived from linear stochastic differential equations (SDEs) to describe networks of measured nodes. Further, a numerical sampling method, preconditioned Crank-Nicolson (pCN), is used to refine coarse-grained trajectories to improve inference accuracy. The simulation conducted on random and ring networks, and a synthetic biological network illustrate that our method achieves state-of-the-art performance compared with group sparse Bayesian learning (GSBL), BINGO, kernel-based methods, dynGENIE3, GENIE3 and ARNI. In particular, these are challenging networks, suggesting that the developed method can be applied under a wide range of contexts.
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v2 [stat.ML] UPDATED)
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.  ( 2 min )
    The Phenomenon of Policy Churn. (arXiv:2206.00730v1 [cs.LG])
    We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $\epsilon$-greedy exploration in a fresh light, namely that $\epsilon$-noise plays a much smaller role than expected.
    Metrizing Fairness. (arXiv:2205.15049v2 [cs.LG] UPDATED)
    We study supervised learning problems for predicting properties of individuals who belong to one of two demographic groups, and we seek predictors that are fair according to statistical parity. This means that the distributions of the predictions within the two groups should be close with respect to the Kolmogorov distance, and fairness is achieved by penalizing the dissimilarity of these two distributions in the objective function of the learning problem. In this paper, we showcase conceptual and computational benefits of measuring unfairness with integral probability metrics (IPMs) other than the Kolmogorov distance. Conceptually, we show that the generator of any IPM can be interpreted as a family of utility functions and that unfairness with respect to this IPM arises if individuals in the two demographic groups have diverging expected utilities. We also prove that the unfairness-regularized prediction loss admits unbiased gradient estimators if unfairness is measured by the squared $\mathcal L^2$-distance or by a squared maximum mean discrepancy. In this case, the fair learning problem is susceptible to efficient stochastic gradient descent (SGD) algorithms. Numerical experiments on real data show that these SGD algorithms outperform state-of-the-art methods for fair learning in that they achieve superior accuracy-unfairness trade-offs -- sometimes orders of magnitude faster. Finally, we identify conditions under which statistical parity can improve prediction accuracy.
    Revisiting the General Identifiability Problem. (arXiv:2206.01081v1 [cs.LG])
    We revisit the problem of general identifiability originally introduced in [Lee et al., 2019] for causal inference and note that it is necessary to add positivity assumption of observational distribution to the original definition of the problem. We show that without such an assumption the rules of do-calculus and consequently the proposed algorithm in [Lee et al., 2019] are not sound. Moreover, adding the assumption will cause the completeness proof in [Lee et al., 2019] to fail. Under positivity assumption, we present a new algorithm that is provably both sound and complete. A nice property of this new algorithm is that it establishes a connection between general identifiability and classical identifiability by Pearl [1995] through decomposing the general identifiability problem into a series of classical identifiability sub-problems.  ( 2 min )
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v3 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Invertible Neural Networks for Graph Prediction. (arXiv:2206.01163v1 [stat.ML])
    In this work, we address conditional generation using deep invertible neural networks. This is a type of problem where one aims to infer the most probable inputs $X$ given outcomes $Y$. We call our method \textit{invertible graph neural network} (iGNN) due to the primary focus on generating node features on graph data. A notable feature of our proposed methods is that during network training, we revise the typically-used loss objective in normalizing flow and consider Wasserstein-2 regularization to facilitate the training process. Algorithmic-wise, we adopt an end-to-end training approach since our objective is to address prediction and generation in the forward and backward processes at once through a single model. Theoretically, we characterize the conditions for identifiability of a true mapping, the existence and invertibility of the mapping, and the expressiveness of iGNN in learning the mapping. Experimentally, we verify the performance of iGNN on both simulated and real-data datasets. We demonstrate through extensive numerical experiments that iGNN shows clear improvement over competing conditional generation benchmarks on high-dimensional and/or non-convex data.
    Indeterminacy in Latent Variable Models: Characterization and Strong Identifiability. (arXiv:2206.00801v1 [stat.ML])
    Most modern latent variable and probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Recent applications of such models have indicated the need for \textit{strongly} identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, most notably by the iVAE (arXiv:1907.04809 [stat.ML]), which excludes many -- but not all -- indeterminacies. We construct a full theoretical framework for analyzing the indeterminacies of latent variable models, and characterize them precisely in terms of properties of the generator functions and the latent variable prior distributions. To illustrate, we apply the framework to better understand the structure of recent identifiability results. We then investigate how we might specify strongly identifiable latent variable models, and construct two such classes of models. One is a straightforward modification of iVAE; the other uses ideas from optimal transport and leads to novel models and connections to recent work.
    Weakly Supervised Representation Learning with Sparse Perturbations. (arXiv:2206.01101v1 [cs.LG])
    The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge or any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables--e.g. images in a reinforcement learning environment where actions move individual sprites--identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.
    Faster Rates of Convergence to Stationary Points in Differentially Private Optimization. (arXiv:2206.00846v1 [cs.LG])
    We study the problem of approximating stationary points of Lipschitz and smooth functions under $(\varepsilon,\delta)$-differential privacy (DP) in both the finite-sum and stochastic settings. A point $\widehat{w}$ is called an $\alpha$-stationary point of a function $F:\mathbb{R}^d\rightarrow\mathbb{R}$ if $\|\nabla F(\widehat{w})\|\leq \alpha$. We provide a new efficient algorithm that finds an $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{2/3}\big)$-stationary point in the finite-sum setting, where $n$ is the number of samples. This improves on the previous best rate of $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$. We also give a new construction that improves over the existing rates in the stochastic optimization setting, where the goal is to find approximate stationary points of the population risk. Our construction finds a $\tilde{O}\big(\frac{1}{n^{1/3}} + \big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$-stationary point of the population risk in time linear in $n$. Furthermore, under the additional assumption of convexity, we completely characterize the sample complexity of finding stationary points of the population risk (up to polylog factors) and show that the optimal rate on population stationarity is $\tilde \Theta\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\big)$. Finally, we show that our methods can be used to provide dimension-independent rates of $O\big(\frac{1}{\sqrt{n}}+\min\big(\big[\frac{\sqrt{rank}}{n\varepsilon}\big]^{2/3},\frac{1}{(n\varepsilon)^{2/5}}\big)\big)$ on population stationarity for Generalized Linear Models (GLM), where $rank$ is the rank of the design matrix, which improves upon the previous best known rate.
    The effective noise of Stochastic Gradient Descent. (arXiv:2112.10852v3 [cond-mat.dis-nn] UPDATED)
    Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v2 [cs.LG] UPDATED)
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Counterfactual reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.
    Sequential Bayesian Neural Subnetwork Ensembles. (arXiv:2206.00794v1 [stat.ML])
    Deep neural network ensembles that appeal to model diversity have been used successfully to improve predictive performance and model robustness in several applications. Whereas, it has recently been shown that sparse subnetworks of dense models can match the performance of their dense counterparts and increase their robustness while effectively decreasing the model complexity. However, most ensembling techniques require multiple parallel and costly evaluations and have been proposed primarily with deterministic models, whereas sparsity induction has been mostly done through ad-hoc pruning. We propose sequential ensembling of dynamic Bayesian neural subnetworks that systematically reduce model complexity through sparsity-inducing priors and generate diverse ensembles in a single forward pass of the model. The ensembling strategy consists of an exploration phase that finds high-performing regions of the parameter space and multiple exploitation phases that effectively exploit the compactness of the sparse model to quickly converge to different minima in the energy landscape corresponding to high-performing subnetworks yielding diverse ensembles. We empirically demonstrate that our proposed approach surpasses the baselines of the dense frequentist and Bayesian ensemble models in prediction accuracy, uncertainty estimation, and out-of-distribution (OoD) robustness on CIFAR10, CIFAR100 datasets, and their out-of-distribution variants: CIFAR10-C, CIFAR100-C induced by corruptions. Furthermore, we found that our approach produced the most diverse ensembles compared to the approaches with a single forward pass and even compared to the approaches with multiple forward passes in some cases.
    Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classifiers. (arXiv:2009.01367v3 [cs.LG] UPDATED)
    While neural network binary classifiers are often evaluated on metrics such as Accuracy and $F_1$-Score, they are commonly trained with a cross-entropy objective. How can this training-evaluation gap be addressed? While specific techniques have been adopted to optimize certain confusion matrix based metrics, it is challenging or impossible in some cases to generalize the techniques to other metrics. Adversarial learning approaches have also been proposed to optimize networks via confusion matrix based metrics, but they tend to be much slower than common training methods. In this work, we propose a unifying approach to training neural network binary classifiers that combines a differentiable approximation of the Heaviside function with a probabilistic view of the typical confusion matrix values using soft sets. Our theoretical analysis shows the benefit of using our method to optimize for a given evaluation metric, such as $F_1$-Score, with soft sets, and our extensive experiments show the effectiveness of our approach in several domains.
    On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting. (arXiv:2206.00761v1 [cs.LG])
    The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.
    Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective. (arXiv:2205.07320v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis (LTH) has attracted attention because it can explain why over-parameterized models often show high generalization ability. It is known that when we use iterative magnitude pruning (IMP), which is an algorithm to find sparse networks with high generalization ability that can be trained from the initial weights independently, called winning tickets, the initial large learning rate does not work well in deep neural networks such as ResNet. However, since the initial large learning rate generally helps the optimizer to converge to flatter minima, we hypothesize that the winning tickets have relatively sharp minima, which is considered a disadvantage in terms of generalization ability. In this paper, we confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets. Finally, we revisit existing algorithms for finding winning tickets from a PAC-Bayesian perspective and provide new insights into these methods.  ( 2 min )
    Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features. (arXiv:1911.05350v2 [stat.ML] UPDATED)
    Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive efficient large-scale learning algorithms. In this study, we consider solving a binary classification problem using random features and stochastic gradient descent. In recent research, an exponential convergence rate of the expected classification error under the strong low-noise condition has been shown. We extend these analyses to a random features setting, analyzing the error induced by the approximation of random features in terms of the distance between the generated hypothesis including population risk minimizers and empirical risk minimizers when using general Lipschitz loss functions, to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. Additionally, we demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems because of the strong low-noise condition.
    Deep neural networks can stably solve high-dimensional, noisy, non-linear inverse problems. (arXiv:2206.00934v1 [math.NA])
    We study the problem of reconstructing solutions of inverse problems with neural networks when only noisy data is available. We assume the problem can be modeled with an infinite-dimensional forward operator that is not continuously invertible. Then, we restrict this forward operator to finite-dimensional spaces so that the inverse is Lipschitz continuous. For the inverse operator, we demonstrate that there exists a neural network which is a robust-to-noise approximation of the function. In addition, we show that these neural networks can be learned from appropriately perturbed training data. We demonstrate the admissibility of this approach to a wide range of inverse problems of practical interest. Numerical examples are given that support the theoretical findings.
    Score-Based Generative Models Detect Manifolds. (arXiv:2206.01018v1 [stat.ML])
    Score-based generative models (SGMs) need to approximate the scores $\nabla \log p_t$ of the intermediate distributions as well as the final distribution $p_T$ of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold $\mathcal{M}$. This assures us that SGMs are able to generate the "right kind of samples". For example, taking $\mathcal{M}$ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.
    Evaluating Modules in Graph Contrastive Learning. (arXiv:2106.08171v2 [cs.LG] UPDATED)
    The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and did not explore the combination space of modules for more comprehensive and systematic studies. For effective \textbf{module-level} evaluation, we propose a framework that decomposes GCL models into four modules: (1) a \textbf{sampler} to generate anchor, positive and negative data samples (nodes or graphs); (2) an \textbf{encoder} and a \textbf{readout} function to get sample embeddings; (3) a \textbf{discriminator} to score each sample pair (anchor-positive and anchor-negative); and (4) an \textbf{estimator} to define the loss function. Based on this framework, we conduct controlled experiments over a wide range of architectural designs and hyperparameter settings on node and graph classification tasks. Specifically, we manage to quantify the impact of a single module, investigate the interaction between modules, and compare the overall performance with current model architectures. Our key findings include a set of module-level guidelines for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an MLP encoder associated with Sum readout could achieve competitive performance on graph classification. Finally, we release our implementations and results as OpenGCL, a modularized toolkit that allows convenient reproduction, standard model and module evaluation, and easy extension. OpenGCL is available at \url{https://github.com/thunlp/OpenGCL}.
    On the Global Convergence Rates of Softmax Policy Gradient Methods. (arXiv:2005.06392v3 [cs.LG] UPDATED)
    We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.
    Feature Space Particle Inference for Neural Network Ensembles. (arXiv:2206.00944v1 [cs.LG])
    Ensembles of deep neural networks demonstrate improved performance over single models. For enhancing the diversity of ensemble members while keeping their performance, particle-based inference methods offer a promising approach from a Bayesian perspective. However, the best way to apply these methods to neural networks is still unclear: seeking samples from the weight-space posterior suffers from inefficiency due to the over-parameterization issues, while seeking samples directly from the function-space posterior often results in serious underfitting. In this study, we propose optimizing particles in the feature space where the activation of a specific intermediate layer lies to address the above-mentioned difficulties. Our method encourages each member to capture distinct features, which is expected to improve ensemble prediction robustness. Extensive evaluation on real-world datasets shows that our model significantly outperforms the gold-standard Deep Ensembles on various metrics, including accuracy, calibration, and robustness. Code is available at https://github.com/DensoITLab/featurePI .
    Boosting Independent Component Analysis. (arXiv:2112.06920v3 [stat.ML] UPDATED)
    Independent component analysis is intended to recover the mutually independent components from their linear mixtures. This technique has been widely used in many fields, such as data analysis, signal processing, and machine learning. To alleviate the dependency on prior knowledge concerning unknown sources, many nonparametric methods have been proposed. In this paper, we present a novel boosting-based algorithm for independent component analysis. Our algorithm consists of maximizing likelihood estimation via boosting and seeking unmixing matrix by the fixed-point method. A variety of experiments validate its performance compared with many of the presently known algorithms.
    Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. (arXiv:2206.00939v1 [stat.ML])
    The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
    Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions. (arXiv:2206.01029v1 [math.OC])
    We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrt{\kappa})$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.
    Bayesian Inference for the Multinomial Probit Model under Gaussian Prior Distribution. (arXiv:2206.00720v1 [stat.ME])
    Multinomial probit (mnp) models are fundamental and widely-applied regression models for categorical data. Fasano and Durante (2022) proved that the class of unified skew-normal distributions is conjugate to several mnp sampling models. This allows to develop Monte Carlo samplers and accurate variational methods to perform Bayesian inference. In this paper, we adapt the abovementioned results for a popular special case: the discrete-choice mnp model under zero mean and independent Gaussian priors. This allows to obtain simplified expressions for the parameters of the posterior distribution and an alternative derivation for the variational algorithm that gives a novel understanding of the fundamental results in Fasano and Durante (2022) as well as computational advantages in our special settings.
    Estimating the Arc Length of the Optimal ROC Curve and Lower Bounding the Maximal AUC. (arXiv:2110.09651v2 [math.ST] UPDATED)
    We show that when the data likelihood ratio is used as the score function, the arc length of the corresponding ROC curve gives rise to a novel $f$-divergence which measures differences between the positive and negative data distributions. This $f$-divergence can be expressed using a variational objective and estimated only using samples from the positive and negative \emph{data} distributions. We show the empirical version of this variational objective is also a consistent estimator for the arctangent likelihood ratio with a non-parametric convergence rate $O_p(n^{-\beta/4})$ ($\beta \in (0,1]$ depends on the smoothness). Moreover, we show the surface area below the optimal ROC curve can be expressed as a similar variational objective depending on the arctangent likelihood ratio. These new insights lead to a novel two-step procedure for finding a good score function by lower bounding the maximal AUC. Experiments on CIFAR-10 datasets show the proposed two-step procedure achieves good AUC performance in imbalanced binary classification tasks while being less computationally demanding.
    Compositional Coding Capsule Network with K-Means Routing for Text Classification. (arXiv:1810.09177v5 [cs.LG] UPDATED)
    Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.
    An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. (arXiv:2206.01182v1 [stat.ML])
    Subsampling methods aim to select a subsample as a surrogate for the observed sample. Such methods have been used pervasively in large-scale data analytics, active learning, and privacy-preserving analysis in recent decades. Instead of model-based methods, in this paper, we study model-free subsampling methods, which aim to identify a subsample that is not confined by model assumptions. Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks. Most of these methods suffer from either a large computational burden or a theoretical weakness. In particular, the theoretical weakness is that the empirical distribution of the selected subsample may not necessarily converge to the population distribution. Such computational and theoretical limitations hinder the broad applicability of model-free subsampling methods in practice. We propose a novel model-free subsampling method by utilizing optimal transport techniques. Moreover, we develop an efficient subsampling algorithm that is adaptive to the unknown probability density function. Theoretically, we show the selected subsample can be used for efficient density estimation by deriving the convergence rate for the proposed subsample kernel density estimator. We also provide the optimal bandwidth for the proposed estimator. Numerical studies on synthetic and real-world datasets demonstrate the performance of the proposed method is superior.
    Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning. (arXiv:2206.01162v1 [cs.LG])
    In this work, we propose a novel ${\bf K}$ernelized ${\bf S}$tein Discrepancy-based Posterior Sampling for ${\bf RL}$ algorithm (named $\texttt{KSRL}$) which extends model-based RL based upon posterior sampling (PSRL) in several ways: we (i) relax the need for any smoothness or Gaussian assumptions, allowing for complex mixture models; (ii) ensure it is applicable to large-scale training by incorporating a compression step such that the posterior consists of a \emph{Bayesian coreset} of only statistically significant past state-action pairs; and (iii) develop a novel regret analysis of PSRL based upon integral probability metrics, which, under a smoothness condition on the constructed posterior, can be evaluated in closed form as the kernelized Stein discrepancy (KSD). Consequently, we are able to improve the $\mathcal{O}(H^{3/2}d\sqrt{T})$ {regret} of PSRL to $\mathcal{O}(H^{3/2}\sqrt{T})$, where $d$ is the input dimension, $H$ is the episode length, and $T$ is the total number of episodes experienced, alleviating a linear dependence on $d$ . Moreover, we theoretically establish a trade-off between regret rate with posterior representational complexity via introducing a compression budget parameter $\epsilon$ based on KSD, and establish a lower bound on the required complexity for consistency of the model. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, with substantive improvements in computation time. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, and can achieve up-to $50\%$ reduction in wall clock time in some continuous control environments.
    Sparse Mixed Linear Regression with Guarantees: Taming an Intractable Problem with Invex Relaxation. (arXiv:2206.01167v1 [cs.LG])
    In this paper, we study the problem of sparse mixed linear regression on an unlabeled dataset that is generated from linear measurements from two different regression parameter vectors. Since the data is unlabeled, our task is not only to figure out a good approximation of the regression parameter vectors but also to label the dataset correctly. In its original form, this problem is NP-hard. The most popular algorithms to solve this problem (such as Expectation-Maximization) have a tendency to stuck at local minima. We provide a novel invex relaxation for this intractable problem which leads to a solution with provable theoretical guarantees. This relaxation enables exact recovery of data labels. Furthermore, we recover a close approximation of the regression parameter vectors which match the true parameter vectors in support and sign. Our formulation uses a carefully constructed primal dual witnesses framework for the invex problem. Furthermore, we show that the sample complexity of our method is only logarithmic in terms of the dimension of the regression parameter vectors.
    Efficient $\Phi$-Regret Minimization in Extensive-Form Games via Online Mirror Descent. (arXiv:2205.15294v2 [cs.LG] UPDATED)
    A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$\Phi$-Hedge} algorithm -- A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $\Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$\Phi$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.
    Primal-dual extrapolation methods for monotone inclusions under local Lipschitz continuity with applications to variational inequality, conic constrained saddle point, and convex conic optimization problems. (arXiv:2206.00973v1 [math.OC])
    In this paper we consider a class of structured monotone inclusion (MI) problems that consist of finding a zero in the sum of two monotone operators, in which one is maximal monotone while another is locally Lipschitz continuous. In particular, we first propose a primal-dual extrapolation (PDE) method for solving a structured strongly MI problem by modifying the classical forward-backward splitting method by using a point and operator extrapolation technique, in which the parameters are adaptively updated by a backtracking line search scheme. The proposed PDE method is almost parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\log \epsilon^{-1})$, measured by the amount of fundamental operations consisting only of evaluations of one operator and resolvent of another operator, for finding an $\epsilon$-residual solution of the structured strongly MI problem. We then propose another PDE method for solving a structured non-strongly MI problem by applying the above PDE method to approximately solve a sequence of structured strongly MI problems. The resulting PDE method is parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\epsilon^{-1}\log \epsilon^{-1})$ for finding an $\epsilon$-residual solution of the structured non-strongly MI problem. As a consequence, we apply the latter PDE method to convex conic optimization, conic constrained saddle point, and variational inequality problems, and obtain complexity results for finding an $\epsilon$-KKT or $\epsilon$-residual solution of them under local Lipschitz continuity. To the best of our knowledge, no prior studies were conducted to investigate methods with complexity guarantees for solving the aforementioned problems under local Lipschitz continuity. All the complexity results obtained in this paper are entirely new.
    DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. (arXiv:2206.00927v1 [cs.LG])
    Diffusion probabilistic models (DPMs) are emerging powerful generative models. Despite their high-quality generation performance, DPMs still suffer from their slow sampling as they generally need hundreds or thousands of sequential function evaluations (steps) of large neural networks to draw a sample. Sampling from DPMs can be viewed alternatively as solving the corresponding diffusion ordinary differential equations (ODEs). In this work, we propose an exact formulation of the solution of diffusion ODEs. The formulation analytically computes the linear part of the solution, rather than leaving all terms to black-box ODE solvers as adopted in previous works. By applying change-of-variable, the solution can be equivalently simplified to an exponentially weighted integral of the neural network. Based on our formulation, we propose DPM-Solver, a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for both discrete-time and continuous-time DPMs without any further training. Experimental results show that DPM-Solver can generate high-quality samples in only 10 to 20 function evaluations on various datasets. We achieve 4.70 FID in 10 function evaluations and 2.87 FID in 20 function evaluations on the CIFAR10 dataset, and a $4\sim 16\times$ speedup compared with previous state-of-the-art training-free samplers on various datasets.
    A Log-Linear Time Sequential Optimal Calibration Algorithm for Quantized Isotonic L2 Regression. (arXiv:2206.00744v1 [cs.LG])
    We study the sequential calibration of estimations in a quantized isotonic L2 regression setting. We start by showing that the optimal calibrated quantized estimations can be acquired from the traditional isotonic L2 regression solution. We modify the traditional PAVA algorithm to create calibrators for both batch and sequential optimization of the quantized isotonic regression problem. Our algorithm can update the optimal quantized monotone mapping for the samples observed so far in linear space and logarithmic time per new unordered sample.  ( 2 min )
    Improving Diffusion Models for Inverse Problems using Manifold Constraints. (arXiv:2206.00941v1 [cs.LG])
    Recently, diffusion models have been used to solve various inverse problems in an unsupervised manner with appropriate modifications to the sampling process. However, the current solvers, which recursively apply a reverse diffusion step followed by a measurement consistency step, often produce sub-optimal results. By studying the generative sampling path, here we show that current solvers throw the sample path off the data manifold, and hence the error accumulates. To address this, we propose an additional correction term inspired by the manifold constraint, which can be used synergistically with the previous solvers to make the iterations close to the manifold. The proposed manifold constraint is straightforward to implement within a few lines of code, yet boosts the performance by a surprisingly large margin. With extensive experiments, we show that our method is superior to the previous methods both theoretically and empirically, producing promising results in many applications such as image inpainting, colorization, and sparse-view computed tomography.  ( 2 min )
    Discovery of interpretable structural model errors by combining Bayesian sparse regression and data assimilation: A chaotic Kuramoto-Sivashinsky test case. (arXiv:2110.00546v2 [physics.comp-ph] UPDATED)
    Models of many engineering and natural systems are imperfect. The discrepancy between the mathematical representations of a true physical system and its imperfect model is called the model error. These model errors can lead to substantial differences between the numerical solutions of the model and the state of the system, particularly in those involving nonlinear, multi-scale phenomena. Thus, there is increasing interest in reducing model errors, particularly by leveraging the rapidly growing observational data to understand their physics and sources. Here, we introduce a framework named MEDIDA: Model Error Discovery with Interpretability and Data Assimilation. MEDIDA only requires a working numerical solver of the model and a small number of noise-free or noisy sporadic observations of the system. In MEDIDA, first the model error is estimated from differences between the observed states and model-predicted states (the latter are obtained from a number of one-time-step numerical integrations from the previous observed states). If observations are noisy, a data assimilation (DA) technique such as ensemble Kalman filter (EnKF) is employed to provide the analysis state of the system, which is then used to estimate the model error. Finally, an equation-discovery technique, here the relevance vector machine (RVM), a sparsity-promoting Bayesian method, is used to identify an interpretable, parsimonious, and closed-form representation of the model error. Using the chaotic Kuramoto-Sivashinsky (KS) system as the test case, we demonstrate the excellent performance of MEDIDA in discovering different types of structural/parametric model errors, representing different types of missing physics, using noise-free and noisy observations.  ( 2 min )
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v2 [cs.LG] UPDATED)
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.  ( 2 min )
    Coordinated Double Machine Learning. (arXiv:2206.00885v1 [stat.ML])
    Double machine learning is a statistical method for leveraging complex black-box models to construct approximately unbiased treatment effect estimates given observational data with high-dimensional covariates, under the assumption of a partially linear model. The idea is to first fit on a subset of the samples two non-linear predictive models, one for the continuous outcome of interest and one for the observed treatment, and then to estimate a linear coefficient for the treatment using the remaining samples through a simple orthogonalized regression. While this methodology is flexible and can accommodate arbitrary predictive models, typically trained independently of one another, this paper argues that a carefully coordinated learning algorithm for deep neural networks may reduce the estimation bias. The improved empirical performance of the proposed method is demonstrated through numerical experiments on both simulated and real data.  ( 2 min )
    Offline Reinforcement Learning with Differential Privacy. (arXiv:2206.00810v1 [cs.LG])
    The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.  ( 2 min )
    Learning Disentangled Representations for Counterfactual Regression via Mutual Information Minimization. (arXiv:2206.01022v1 [cs.LG])
    Learning individual-level treatment effect is a fundamental problem in causal inference and has received increasing attention in many areas, especially in the user growth area which concerns many internet companies. Recently, disentangled representation learning methods that decompose covariates into three latent factors, including instrumental, confounding and adjustment factors, have witnessed great success in treatment effect estimation. However, it remains an open problem how to learn the underlying disentangled factors precisely. Specifically, previous methods fail to obtain independent disentangled factors, which is a necessary condition for identifying treatment effect. In this paper, we propose Disentangled Representations for Counterfactual Regression via Mutual Information Minimization (MIM-DRCFR), which uses a multi-task learning framework to share information when learning the latent factors and incorporates MI minimization learning criteria to ensure the independence of these factors. Extensive experiments including public benchmarks and real-world industrial user growth datasets demonstrate that our method performs much better than state-of-the-art methods.  ( 2 min )
    A Confirmation of a Conjecture on the Feldman's Two-armed Bandit Problem. (arXiv:2206.00821v1 [math.ST])
    Myopic strategy is one of the most important strategies when studying bandit problems. In this paper, we consider the two-armed bandit problem proposed by Feldman. With general distributions and utility functions, we obtain a necessary and sufficient condition for the optimality of the myopic strategy. As an application, we could solve Nouiehed and Ross's conjecture for Bernoulli two-armed bandit problems that myopic strategy stochastically maximizes the number of wins.  ( 2 min )
    Defense Against Gradient Leakage Attacks via Learning to Obscure Data. (arXiv:2206.00769v1 [cs.LG])
    Federated learning is considered as an effective privacy-preserving learning mechanism that separates the client's data and model training process. However, federated learning is still under the risk of privacy leakage because of the existence of attackers who deliberately conduct gradient leakage attacks to reconstruct the client data. Recently, popular strategies such as gradient perturbation methods and input encryption methods have been proposed to defend against gradient leakage attacks. Nevertheless, these defenses can either greatly sacrifice the model performance, or be evaded by more advanced attacks. In this paper, we propose a new defense method to protect the privacy of clients' data by learning to obscure data. Our defense method can generate synthetic samples that are totally distinct from the original samples, but they can also maximally preserve their predictive features and guarantee the model performance. Furthermore, our defense strategy makes the gradient leakage attack and its variants extremely difficult to reconstruct the client data. Through extensive experiments, we show that our proposed defense method obtains better privacy protection while preserving high accuracy compared with state-of-the-art methods.  ( 2 min )
    Collaborative Learning of Distributions under Heterogeneity and Communication Constraints. (arXiv:2206.00707v1 [stat.ML])
    In modern machine learning, users often have to collaborate to learn distributions that generate the data. Communication can be a significant bottleneck. Prior work has studied homogeneous users -- i.e., whose data follow the same discrete distribution -- and has provided optimal communication-efficient methods. However, these methods rely heavily on homogeneity, and are less applicable in the common case when users' discrete distributions are heterogeneous. Here we consider a natural and tractable model of heterogeneity, where users' discrete distributions only vary sparsely, on a small number of entries. We propose a novel two-stage method named SHIFT: First, the users collaborate by communicating with the server to learn a central distribution; relying on methods from robust statistics. Then, the learned central distribution is fine-tuned to estimate the individual distributions of users. We show that SHIFT is minimax optimal in our model of heterogeneity and under communication constraints. Further, we provide experimental results using both synthetic data and $n$-gram frequency estimation in the text domain, which corroborate its efficiency.  ( 2 min )
    Split-kl and PAC-Bayes-split-kl Inequalities. (arXiv:2206.00706v1 [stat.ML])
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality combines the combinatorial power of the kl inequality with ability to exploit low variance. While for Bernoulli random variables the kl inequality is tighter than the Empirical Bernstein, for random variables taking values inside a bounded interval and having low variance the Empirical Bernstein inequality is tighter than the kl. The proposed split-kl inequality yields the best of both worlds. We discuss an application of the split-kl inequality to bounding excess losses. We also derive a PAC-Bayes-split-kl inequality and use a synthetic example and several UCI datasets to compare it with the PAC-Bayes-kl, PAC-Bayes Empirical Bernstein, PAC-Bayes Unexpected Bernstein, and PAC-Bayes Empirical Bennett inequalities.  ( 2 min )
  • Open

    Robust Feature-Level Adversaries are Interpretability Tools. (arXiv:2110.03605v4 [cs.LG] UPDATED)
    The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying the representations in models. Second, we show that these adversaries are versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results indicate that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations.
    WebGPT: Browser-assisted question-answering with human feedback. (arXiv:2112.09332v3 [cs.CL] UPDATED)
    We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
    Metrizing Fairness. (arXiv:2205.15049v2 [cs.LG] UPDATED)
    We study supervised learning problems for predicting properties of individuals who belong to one of two demographic groups, and we seek predictors that are fair according to statistical parity. This means that the distributions of the predictions within the two groups should be close with respect to the Kolmogorov distance, and fairness is achieved by penalizing the dissimilarity of these two distributions in the objective function of the learning problem. In this paper, we showcase conceptual and computational benefits of measuring unfairness with integral probability metrics (IPMs) other than the Kolmogorov distance. Conceptually, we show that the generator of any IPM can be interpreted as a family of utility functions and that unfairness with respect to this IPM arises if individuals in the two demographic groups have diverging expected utilities. We also prove that the unfairness-regularized prediction loss admits unbiased gradient estimators if unfairness is measured by the squared $\mathcal L^2$-distance or by a squared maximum mean discrepancy. In this case, the fair learning problem is susceptible to efficient stochastic gradient descent (SGD) algorithms. Numerical experiments on real data show that these SGD algorithms outperform state-of-the-art methods for fair learning in that they achieve superior accuracy-unfairness trade-offs -- sometimes orders of magnitude faster. Finally, we identify conditions under which statistical parity can improve prediction accuracy.
    A Hybrid Spatial-temporal Deep Learning Architecture for Lane Detection. (arXiv:2110.04079v4 [cs.CV] UPDATED)
    Accurate and reliable lane detection is vital for the safe performance of lane-keeping assistance and lane departure warning systems. However, under certain challenging circumstances, it is difficult to get satisfactory performance in accurately detecting the lanes from one single image as mostly done in current literature. Since lane markings are continuous lines, the lanes that are difficult to be accurately detected in the current single image can potentially be better deduced if information from previous frames is incorporated. This study proposes a novel hybrid spatial-temporal (ST) sequence-to-one deep learning architecture. This architecture makes full use of the ST information in multiple continuous image frames to detect the lane markings in the very last frame. Specifically, the hybrid model integrates the following aspects: (a) the single image feature extraction module equipped with the spatial convolutional neural network; (b) the ST feature integration module constructed by ST recurrent neural network; (c) the encoder-decoder structure, which makes this image segmentation problem work in an end-to-end supervised learning format. Extensive experiments reveal that the proposed model architecture can effectively handle challenging driving scenes and outperforms available state-of-the-art methods.
    Pre-Trained Language Models for Interactive Decision-Making. (arXiv:2202.01771v3 [cs.LG] UPDATED)
    Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. We then examine how our framework may be used in environments without pre-collected expert data. To do this, we integrate an active data gathering procedure into pre-trained LMs. The agent iteratively learns by interacting with the environment, relabeling the language goal of past 'failed' experiences, and updating the policy in a self-supervised loop. The active data gathering procedure also enables effective combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and favorable weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans.
    When does return-conditioned supervised learning work for offline reinforcement learning?. (arXiv:2206.01079v1 [cs.LG])
    Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous study of the capabilities and limitations of RCSL, something which is crucially missing in previous work. We find that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms. We provide specific examples of MDPs and datasets that illustrate the necessity of these assumptions and the limits of RCSL. Finally, we present empirical evidence that these limitations will also cause issues in practice by providing illustrative experiments in simple point-mass environments and on datasets from the D4RL benchmark.
    Sentiment Analysis and Effect of COVID-19 Pandemic using College SubReddit Data. (arXiv:2112.04351v2 [cs.CL] UPDATED)
    Background: The COVID-19 pandemic has affected our society and human well-being in various ways. In this study, we investigate how the pandemic has influenced people's emotions and psychological states compared to a pre-pandemic period using real-world data from social media. Method: We collected Reddit social media data from 2019 (pre-pandemic) and 2020 (pandemic) from the subreddits communities associated with eight universities. We applied the pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) to learn text embedding from the Reddit messages, and leveraged the relational information among posted messages to train a graph attention network (GAT) for sentiment classification. Finally, we applied model stacking to combine the prediction probabilities from RoBERTa and GAT to yield the final classification on sentiment. With the model-predicted sentiment labels on the collected data, we used a generalized linear mixed-effects model to estimate the effects of pandemic and in-person teaching during the pandemic on sentiment. Results: The results suggest that the odds of negative sentiments in 2020 (pandemic) were 25.7% higher than the odds in 2019 (pre-pandemic) with a $p$-value $<0.001$; and the odds of negative sentiments associated in-person learning were 48.3% higher than with remote learning in 2020 with a $p$-value of 0.029. Conclusions: Our study results are consistent with the findings in the literature on the negative impacts of the pandemic on people's emotions and psychological states. Our study contributes to the growing real-world evidence on the various negative impacts of the pandemic on our society; it also provides a good example of using both ML techniques and statistical modeling and inference to make better use of real-world data.
    AQuaMoHo: Localized Low-Cost Outdoor Air Quality Sensing over a Thermo-Hygrometer. (arXiv:2204.11484v2 [cs.CY] UPDATED)
    Efficient air quality sensing serves as one of the essential services provided in any recent smart city. Mostly facilitated by sparsely deployed Air Quality Monitoring Stations (AQMSs) that are difficult to install and maintain, the overall spatial variation heavily impacts air quality monitoring for locations far enough from these pre-deployed public infrastructures. To mitigate this, we in this paper propose a framework named AQuaMoHo that can annotate data obtained from a low-cost thermo-hygrometer (as the sole physical sensing device) with the AQI labels, with the help of additional publicly crawled Spatio-temporal information of that locality. At its core, AQuaMoHo exploits the temporal patterns from a set of readily available spatial features using an LSTM-based model and further enhances the overall quality of the annotation using temporal attention. From a thorough study of two different cities, we observe that AQuaMoHo can significantly help annotate the air quality data on a personal scale.
    Evaluating Modules in Graph Contrastive Learning. (arXiv:2106.08171v2 [cs.LG] UPDATED)
    The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and did not explore the combination space of modules for more comprehensive and systematic studies. For effective \textbf{module-level} evaluation, we propose a framework that decomposes GCL models into four modules: (1) a \textbf{sampler} to generate anchor, positive and negative data samples (nodes or graphs); (2) an \textbf{encoder} and a \textbf{readout} function to get sample embeddings; (3) a \textbf{discriminator} to score each sample pair (anchor-positive and anchor-negative); and (4) an \textbf{estimator} to define the loss function. Based on this framework, we conduct controlled experiments over a wide range of architectural designs and hyperparameter settings on node and graph classification tasks. Specifically, we manage to quantify the impact of a single module, investigate the interaction between modules, and compare the overall performance with current model architectures. Our key findings include a set of module-level guidelines for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an MLP encoder associated with Sum readout could achieve competitive performance on graph classification. Finally, we release our implementations and results as OpenGCL, a modularized toolkit that allows convenient reproduction, standard model and module evaluation, and easy extension. OpenGCL is available at \url{https://github.com/thunlp/OpenGCL}.
    Efficient $\Phi$-Regret Minimization in Extensive-Form Games via Online Mirror Descent. (arXiv:2205.15294v2 [cs.LG] UPDATED)
    A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$\Phi$-Hedge} algorithm -- A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $\Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$\Phi$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.
    Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning. (arXiv:2206.01162v1 [cs.LG])
    In this work, we propose a novel ${\bf K}$ernelized ${\bf S}$tein Discrepancy-based Posterior Sampling for ${\bf RL}$ algorithm (named $\texttt{KSRL}$) which extends model-based RL based upon posterior sampling (PSRL) in several ways: we (i) relax the need for any smoothness or Gaussian assumptions, allowing for complex mixture models; (ii) ensure it is applicable to large-scale training by incorporating a compression step such that the posterior consists of a \emph{Bayesian coreset} of only statistically significant past state-action pairs; and (iii) develop a novel regret analysis of PSRL based upon integral probability metrics, which, under a smoothness condition on the constructed posterior, can be evaluated in closed form as the kernelized Stein discrepancy (KSD). Consequently, we are able to improve the $\mathcal{O}(H^{3/2}d\sqrt{T})$ {regret} of PSRL to $\mathcal{O}(H^{3/2}\sqrt{T})$, where $d$ is the input dimension, $H$ is the episode length, and $T$ is the total number of episodes experienced, alleviating a linear dependence on $d$ . Moreover, we theoretically establish a trade-off between regret rate with posterior representational complexity via introducing a compression budget parameter $\epsilon$ based on KSD, and establish a lower bound on the required complexity for consistency of the model. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, with substantive improvements in computation time. Experimentally, we observe that this approach is competitive with several state of the art RL methodologies, and can achieve up-to $50\%$ reduction in wall clock time in some continuous control environments.
    DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis. (arXiv:2206.01062v1 [cs.CV])
    Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
    Graph Signal Restoration Using Nested Deep Algorithm Unrolling. (arXiv:2106.15910v3 [eess.SP] UPDATED)
    Graph signal processing is a ubiquitous task in many applications such as sensor, social, transportation and brain networks, point cloud processing, and graph neural networks. Often, graph signals are corrupted in the sensing process, thus requiring restoration. In this paper, we propose two graph signal restoration methods based on deep algorithm unrolling (DAU). First, we present a graph signal denoiser by unrolling iterations of the alternating direction method of multiplier (ADMM). We then suggest a general restoration method for linear degradation by unrolling iterations of Plug-and-Play ADMM (PnP-ADMM). In the second approach, the unrolled ADMM-based denoiser is incorporated as a submodule, leading to a nested DAU structure. The parameters in the proposed denoising/restoration methods are trainable in an end-to-end manner. Our approach is interpretable and keeps the number of parameters small since we only tune graph-independent regularization parameters. We overcome two main challenges in existing graph signal restoration methods: 1) limited performance of convex optimization algorithms due to fixed parameters which are often determined manually. 2) large number of parameters of graph neural networks that result in difficulty of training. Several experiments for graph signal denoising and interpolation are performed on synthetic and real-world data. The proposed methods show performance improvements over several existing techniques in terms of root mean squared error in both tasks.
    Locating and Editing Factual Associations in GPT. (arXiv:2202.05262v3 [cs.CL] UPDATED)
    We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/
    Masked Bayesian Neural Networks : Computation and Optimality. (arXiv:2206.00853v1 [stat.ML])
    As data size and computing power increase, the architectures of deep neural networks (DNNs) have been getting more complex and huge, and thus there is a growing need to simplify such complex and huge DNNs. In this paper, we propose a novel sparse Bayesian neural network (BNN) which searches a good DNN with an appropriate complexity. We employ the masking variables at each node which can turn off some nodes according to the posterior distribution to yield a nodewise sparse DNN. We devise a prior distribution such that the posterior distribution has theoretical optimalities (i.e. minimax optimality and adaptiveness), and develop an efficient MCMC algorithm. By analyzing several benchmark datasets, we illustrate that the proposed BNN performs well compared to other existing methods in the sense that it discovers well condensed DNN architectures with similar prediction accuracy and uncertainty quantification compared to large DNNs.
    A Barrier Certificate-based Simplex Architecture with Application to Microgrids. (arXiv:2202.09710v2 [eess.SY] UPDATED)
    We present Barrier Certificate-based Simplex (BC-Simplex), a new, provably correct design for runtime assurance of continuous dynamical systems. BC-Simplex is centered around the Simplex Control Architecture, which consists of a high-performance advanced controller which is not guaranteed to maintain safety of the plant, a verified-safe baseline controller, and a decision module that switches control of the plant between the two controllers to ensure safety without sacrificing performance. In BC-Simplex, Barrier certificates are used to prove that the baseline controller ensures safety. Furthermore, BC-Simplex features a new automated method for deriving, from the barrier certificate, the conditions for switching between the controllers. Our method is based on the Taylor expansion of the barrier certificate and yields computationally inexpensive switching conditions. We consider a significant application of BC-Simplex to a microgrid featuring an advanced controller in the form of a neural network trained using reinforcement learning. The microgrid is modeled in RTDS, an industry-standard high-fidelity, real-time power systems simulator. Our results demonstrate that BC-Simplex can automatically derive switching conditions for complex systems, the switching conditions are not overly conservative, and BC-Simplex ensures safety even in the presence of adversarial attacks on the neural controller.
    Batch Normalization Is Blind to the First and Second Derivatives of the Loss. (arXiv:2205.15146v2 [cs.LG] UPDATED)
    In this paper, we prove the effects of the BN operation on the back-propagation of the first and second derivatives of the loss. When we do the Taylor series expansion of the loss function, we prove that the BN operation will block the influence of the first-order term and most influence of the second-order term of the loss. We also find that such a problem is caused by the standardization phase of the BN operation. Experimental results have verified our theoretical conclusions, and we have found that the BN operation significantly affects feature representations in specific tasks, where losses of different samples share similar analytic formulas.
    ZOOpt: Toolbox for Derivative-Free Optimization. (arXiv:1801.00329v3 [cs.LG] UPDATED)
    Recent advances in derivative-free optimization allow efficient approximation of the global-optimal solutions of sophisticated functions, such as functions with many local optima, non-differentiable and non-continuous functions. This article describes the ZOOpt (Zeroth Order Optimization) toolbox that provides efficient derivative-free solvers and is designed easy to use. ZOOpt provides single-machine parallel optimization on the basis of python core and multi-machine distributed optimization for time-consuming tasks by incorporating with the Ray framework -- a famous platform for building distributed applications. ZOOpt particularly focuses on optimization problems in machine learning, addressing high-dimensional and noisy problems such as hyper-parameter tuning and direct policy search. The toolbox is maintained toward a ready-to-use tool in real-world machine learning tasks.
    Intrinsically-Motivated Reinforcement Learning: A Brief Introduction. (arXiv:2203.02298v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is one of the three basic paradigms of machine learning. It has demonstrated impressive performance in many complex tasks like Go and StarCraft, which is increasingly involved in smart manufacturing and autonomous driving. However, RL consistently suffers from the exploration-exploitation dilemma. In this paper, we investigated the problem of improving exploration in RL and introduced the intrinsically-motivated RL. In sharp contrast to the classic exploration strategies, intrinsically-motivated RL utilizes the intrinsic learning motivation to provide sustainable exploration incentives. We carefully classified the existing intrinsic reward methods and analyzed their practical drawbacks. Moreover, we proposed a new intrinsic reward method via R\'enyi state entropy maximization, which overcomes the drawbacks of the preceding methods and provides powerful exploration incentives. Finally, extensive simulation demonstrated that the proposed module achieve superior performance with higher efficiency and robustness.
    Learning Disentangled Representations for Counterfactual Regression via Mutual Information Minimization. (arXiv:2206.01022v1 [cs.LG])
    Learning individual-level treatment effect is a fundamental problem in causal inference and has received increasing attention in many areas, especially in the user growth area which concerns many internet companies. Recently, disentangled representation learning methods that decompose covariates into three latent factors, including instrumental, confounding and adjustment factors, have witnessed great success in treatment effect estimation. However, it remains an open problem how to learn the underlying disentangled factors precisely. Specifically, previous methods fail to obtain independent disentangled factors, which is a necessary condition for identifying treatment effect. In this paper, we propose Disentangled Representations for Counterfactual Regression via Mutual Information Minimization (MIM-DRCFR), which uses a multi-task learning framework to share information when learning the latent factors and incorporates MI minimization learning criteria to ensure the independence of these factors. Extensive experiments including public benchmarks and real-world industrial user growth datasets demonstrate that our method performs much better than state-of-the-art methods.
    Super-resolving 2D stress tensor field conserving equilibrium constraints using physics informed U-Net. (arXiv:2206.01122v1 [cs.LG])
    In a finite element analysis, using a large number of grids is important to obtain accurate results, but is a resource-consuming task. Aiming to real-time simulation and optimization, it is desired to obtain fine grid analysis results within a limited resource. This paper proposes a super-resolution method that predicts a stress tensor field in a high-resolution from low-resolution contour plots by utilizing a U-Net-based neural network which is called PI-UNet. In addition, the proposed model minimizes the residual of the equilibrium constraints so that it outputs a physically reasonable solution. The proposed network is trained with FEM results of simple shapes, and is validated with a complicated realistic shape to evaluate generalization capability. Although ESRGAN is a standard model for image super-resolution, the proposed U-Net based model outperforms ESRGAN model in the stress tensor prediction task.
    Meta-SysId: A Meta-Learning Approach for Simultaneous Identification and Prediction. (arXiv:2206.00694v1 [cs.LG])
    In this paper, we propose Meta-SysId, a meta-learning approach to model sets of systems that have behavior governed by common but unknown laws and that differentiate themselves by their context. Inspired by classical modeling-and-identification approaches, Meta-SysId learns to represent the common law through shared parameters and relies on online optimization to compute system-specific context. Compared to optimization-based meta-learning methods, the separation between class parameters and context variables reduces the computational burden while allowing batch computations and a simple training scheme. We test Meta-SysId on polynomial regression, time-series prediction, model-based control, and real-world traffic prediction domains, empirically finding it outperforms or is competitive with meta-learning baselines.
    Composing a surrogate observation operator for sequential data assimilation. (arXiv:2201.12514v3 [cs.LG] UPDATED)
    In data assimilation, state estimation is not straightforward when the observation operator is unknown. This study proposes a method for composing a surrogate operator when the true operator is unknown. A neural network is used to improve the surrogate model iteratively to decrease the difference between the observations and the results of the surrogate model. A twin experiment suggests that the proposed method outperforms approaches that tentatively use a specific operator throughout the data assimilation process.
    Dynamic MRI using Learned Transform-based Deep Tensor Low-Rank Network (DTLR-Net). (arXiv:2206.00850v1 [eess.IV])
    While low-rank matrix prior has been exploited in dynamic MR image reconstruction and has obtained satisfying performance, low-rank tensors models have recently emerged as powerful alternative representations for three-dimensional dynamic MR datasets. In this paper, we introduce a model-based deep learning network by learning the tensor low-rank prior of the cardiac dynamic MR images. Instead of representing the dynamic dataset as a low-rank tensor directly, we propose a learned transformation operator to exploit the tensor low-rank property in a transform domain. In particular, by generalizing the t-SVD tensor decomposition into a unitary transformed t-SVD, we define a transformed tensor nuclear norm (TTNN) to enforce the tensor low-rankness. The dynamic MRI reconstruction problem is thus formulated using a TTNN regularized optimization problem. An iterative algorithm based on ADMM used to minimize the cost is unrolled into a deep network, where the transform is learned using convolutional neural networks (CNNs) to promote the reconstruction quality in the feature domain. Experimental results on cardiac cine MRI reconstruction demonstrate that the proposed framework is able to provide improved recovery results compared with the state-of-the-art algorithms.
    Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions. (arXiv:2206.01029v1 [math.OC])
    We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrt{\kappa})$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.
    Practical Adversarial Multivalid Conformal Prediction. (arXiv:2206.01067v1 [cs.LG])
    We give a simple, generic conformal prediction method for sequential prediction that achieves target empirical coverage guarantees against adversarially chosen data. It is computationally lightweight -- comparable to split conformal prediction -- but does not require having a held-out validation set, and so all data can be used for training models from which to derive a conformal score. It gives stronger than marginal coverage guarantees in two ways. First, it gives threshold calibrated prediction sets that have correct empirical coverage even conditional on the threshold used to form the prediction set from the conformal score. Second, the user can specify an arbitrary collection of subsets of the feature space -- possibly intersecting -- and the coverage guarantees also hold conditional on membership in each of these subsets. We call our algorithm MVP, short for MultiValid Prediction. We give both theory and an extensive set of empirical evaluations.
    Deep Optimal Transport for Domain Adaptation on SPD Manifolds. (arXiv:2201.05745v2 [cs.LG] UPDATED)
    The domain adaptation (DA) problem on symmetric positive definite (SPD) manifolds has raised interest in the machine learning community because of the growing potential for the SPD-matrix representations across many cross-domain applicable scenarios. However, due to the different underlying space, the previous experience and solution to the DA problem cannot benefit this new scenario directly. This study addresses a specific DA problem: the marginal and conditional distributions differ in the source and target domains on SPD manifolds. We then formalize this problem from an optimal transport perspective and derive an optimal transport framework on SPD manifolds for supervised learning. In addition, we propose a computational scheme under the optimal transport framework, Deep Optimal Transport (DOT), for general computation, using the generalized joint distribution adaptation approach and the existing Riemannian-based network architectures on SPD manifolds. DOT is applied to the real-world scenario and becomes a specific EEG-BCI classifier against the cross-session motor-imagery classification from the calibration phase to the feedback phase. In the experiments, DOT exhibits a marked improvement in the average accuracy in two highly non-stationary cross-session scenarios in the EEG-BCI classification, respectively, indicating the proposed methodology's validity.
    Weakly Supervised Representation Learning with Sparse Perturbations. (arXiv:2206.01101v1 [cs.LG])
    The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge or any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables--e.g. images in a reinforcement learning environment where actions move individual sprites--identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.  ( 2 min )
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v3 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v2 [cs.LG] UPDATED)
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Counterfactual reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.
    Score-Based Generative Models Detect Manifolds. (arXiv:2206.01018v1 [stat.ML])
    Score-based generative models (SGMs) need to approximate the scores $\nabla \log p_t$ of the intermediate distributions as well as the final distribution $p_T$ of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold $\mathcal{M}$. This assures us that SGMs are able to generate the "right kind of samples". For example, taking $\mathcal{M}$ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.
    Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise. (arXiv:2206.01095v1 [math.OC])
    Stochastic first-order methods such as Stochastic Extragradient (SEG) or Stochastic Gradient Descent-Ascent (SGDA) for solving smooth minimax problems and, more generally, variational inequality problems (VIP) have been gaining a lot of attention in recent years due to the growing popularity of adversarial formulations in machine learning. However, while high-probability convergence bounds are known to reflect the actual behavior of stochastic methods more accurately, most convergence results are provided in expectation. Moreover, the only known high-probability complexity results have been derived under restrictive sub-Gaussian (light-tailed) noise and bounded domain Assump. [Juditsky et al., 2011]. In this work, we prove the first high-probability complexity results with logarithmic dependence on the confidence level for stochastic methods for solving monotone and structured non-monotone VIPs with non-sub-Gaussian (heavy-tailed) noise and unbounded domains. In the monotone case, our results match the best-known ones in the light-tails case [Juditsky et al., 2011], and are novel for structured non-monotone problems such as negative comonotone, quasi-strongly monotone, and/or star-cocoercive ones. We achieve these results by studying SEG and SGDA with clipping. In addition, we numerically validate that the gradient noise of many practical GAN formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
    Auto-Lambda: Disentangling Dynamic Task Relationships. (arXiv:2202.03091v2 [cs.LG] UPDATED)
    Understanding the structure of multiple related tasks allows for multi-task learning to improve the generalisation ability of one or all of them. However, it usually requires training each pairwise combination of tasks together in order to capture task relationships, at an extremely high computational cost. In this work, we learn task relationships via an automated weighting framework, named Auto-Lambda. Unlike previous methods where task relationships are assumed to be fixed, Auto-Lambda is a gradient-based meta learning framework which explores continuous, dynamic task relationships via task-specific weightings, and can optimise any choice of combination of tasks through the formulation of a meta-loss; where the validation loss automatically influences task weightings throughout training. We apply the proposed framework to both multi-task and auxiliary learning problems in computer vision and robotics, and show that Auto-Lambda achieves state-of-the-art performance, even when compared to optimisation strategies designed specifically for each problem and data domain. Finally, we observe that Auto-Lambda can discover interesting learning behaviors, leading to new insights in multi-task learning. Code is available at https://github.com/lorenmt/auto-lambda.
    Predictive Multiplicity in Probabilistic Classification. (arXiv:2206.01131v1 [cs.LG])
    For a prediction task, there may exist multiple models that perform almost equally well. This multiplicity complicates how we typically develop and deploy machine learning models. We study how multiplicity affects predictions -- i.e., predictive multiplicity -- in probabilistic classification. We introduce new measures for this setting and present optimization-based methods to compute these measures for convex empirical risk minimization problems like logistic regression. We apply our methodology to gain insight into why predictive multiplicity arises. We study the incidence and prevalence of predictive multiplicity in real-world risk assessment tasks. Our results emphasize the need to report multiplicity more widely.
    Progressive Purification for Instance-Dependent Partial Label Learning. (arXiv:2206.00830v1 [cs.LG])
    Partial label learning (PLL) aims to train multi-class classifiers from instances with partial labels (PLs)-a PL for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. In the last few years, the instance-independent generation process of PLs has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, whereas relatively less attention has been paid to the practical setting of instance-dependent PLs, namely, the PL depends not only on the true label but the instance itself. In this paper, we propose a theoretically grounded and practically effective approach called PrOgressive Purification (POP) for instance-dependent PLL: in each epoch, POP updates the learning model while purifying each PL for the next epoch of the model training by progressively moving out false candidate labels. Theoretically, we prove that POP enlarges the region appropriately fast where the model is reliable, and eventually approximates the Bayes optimal classifier with mild assumptions; technically, POP is flexible with arbitrary losses and compatible with deep networks, so that the previous advanced PLL losses can be embedded in it and the performance is often significantly improved.
    Uniqueness and Complexity of Inverse MDP Models. (arXiv:2206.01192v1 [cs.LG])
    What is the action sequence aa'a" that was likely responsible for reaching state s"' (from state s) in 3 steps? Addressing such questions is important in causal reasoning and in reinforcement learning. Inverse "MDP" models p(aa'a"|ss"') can be used to answer them. In the traditional "forward" view, transition "matrix" p(s'|sa) and policy {\pi}(a|s) uniquely determine "everything": the whole dynamics p(as'a's"a"...|s), and with it, the action-conditional state process p(s's"...|saa'a"), the multi-step inverse models p(aa'a"...|ss^i), etc. If the latter is our primary concern, a natural question, analogous to the forward case is to which extent 1-step inverse model p(a|ss') plus policy {\pi}(a|s) determine the multi-step inverse models or even the whole dynamics. In other words, can forward models be inferred from inverse models or even be side-stepped. This work addresses this question and variations thereof, and also whether there are efficient decision/inference algorithms for this.
    APP: Anytime Progressive Pruning. (arXiv:2204.01640v2 [cs.LG] UPDATED)
    With the latest advances in deep learning, there has been a lot of focus on the online learning paradigm due to its relevance in practical settings. Although many methods have been investigated for optimal learning settings in scenarios where the data stream is continuous over time, sparse networks training in such settings have often been overlooked. In this paper, we explore the problem of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA). We propose a novel way of progressive pruning, referred to as \textit{Anytime Progressive Pruning} (APP); the proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training. Our method, for example, shows an improvement in accuracy of $\approx 7\%$ and a reduction in the generalization gap by $\approx 22\%$, while being $\approx 1/3$ rd the size of the dense baseline model in few-shot restricted imagenet training. We further observe interesting nonmonotonic transitions in the generalization gap in the high number of megabatches-based ALMA. The code and experiment dashboards can be accessed at \url{https://github.com/landskape-ai/Progressive-Pruning} and \url{https://wandb.ai/landskape/APP}, respectively.
    Machine Learning-based Lung and Colon Cancer Detection using Deep Feature Extraction and Ensemble Learning. (arXiv:2206.01088v1 [eess.IV])
    Cancer is a fatal disease caused by a combination of genetic diseases and a variety of biochemical abnormalities. Lung and colon cancer have emerged as two of the leading causes of death and disability in humans. The histopathological detection of such malignancies is usually the most important component in determining the best course of action. Early detection of the ailment on either front considerably decreases the likelihood of mortality. Machine learning and deep learning techniques can be utilized to speed up such cancer detection, allowing researchers to study a large number of patients in a much shorter amount of time and at a lower cost. In this research work, we introduced a hybrid ensemble feature extraction model to efficiently identify lung and colon cancer. It integrates deep feature extraction and ensemble learning with high-performance filtering for cancer image datasets. The model is evaluated on histopathological (LC25000) lung and colon datasets. According to the study findings, our hybrid model can detect lung, colon, and (lung and colon) cancer with accuracy rates of 99.05%, 100%, and 99.30%, respectively. The study's findings show that our proposed strategy outperforms existing models significantly. Thus, these models could be applicable in clinics to support the doctor in the diagnosis of cancers.
    Automated Reinforcement Learning (AutoRL): A Survey and Open Problems. (arXiv:2201.03916v2 [cs.LG] UPDATED)
    The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward.  ( 2 min )
    Sequential Voting with Relational Box Fields for Active Object Detection. (arXiv:2110.11524v4 [cs.CV] UPDATED)
    A key component of understanding hand-object interactions is the ability to identify the active object -- the object that is being manipulated by the human hand. In order to accurately localize the active object, any method must reason using information encoded by each image pixel, such as whether it belongs to the hand, the object, or the background. To leverage each pixel as evidence to determine the bounding box of the active object, we propose a pixel-wise voting function. Our pixel-wise voting function takes an initial bounding box as input and produces an improved bounding box of the active object as output. The voting function is designed so that each pixel inside of the input bounding box votes for an improved bounding box, and the box with the majority vote is selected as the output. We call the collection of bounding boxes generated inside of the voting function, the Relational Box Field, as it characterizes a field of bounding boxes defined in relationship to the current bounding box. While our voting function is able to improve the bounding box of the active object, one round of voting is typically not enough to accurately localize the active object. Therefore, we repeatedly apply the voting function to sequentially improve the location of the bounding box. However, since it is known that repeatedly applying a one-step predictor (i.e., auto-regressive processing with our voting function) can cause a data distribution shift, we mitigate this issue using reinforcement learning (RL). We adopt standard RL to learn the voting function parameters and show that it provides a meaningful improvement over a standard supervised learning approach. We perform experiments on two large-scale datasets: 100DOH and MECCANO, improving AP50 performance by 8% and 30%, respectively, over the state of the art.  ( 3 min )
    Adaptive Local Neighborhood-based Neural Networks for MR Image Reconstruction from Undersampled Data. (arXiv:2206.00775v1 [eess.IV])
    Recent medical image reconstruction techniques focus on generating high-quality medical images suitable for clinical use at the lowest possible cost and with the fewest possible adverse effects on patients. Recent works have shown significant promise for reconstructing MR images from sparsely sampled k-space data using deep learning. In this work, we propose a technique that rapidly estimates deep neural networks directly at reconstruction time by fitting them on small adaptively estimated neighborhoods of a training set. In brief, our algorithm alternates between searching for neighbors in a data set that are similar to the test reconstruction, and training a local network on these neighbors followed by updating the test reconstruction. Because our reconstruction model is learned on a dataset that is structurally similar to the image being reconstructed rather than being fit on a large, diverse training set, it is more adaptive to new scans. It can also handle changes in training sets and flexible scan settings, while being relatively fast. Our approach, dubbed LONDN-MRI, was validated on the FastMRI multi-coil knee data set using deep unrolled reconstruction networks. Reconstructions were performed at four fold and eight fold undersampling of k-space with 1D variable-density random phase-encode undersampling masks. Our results demonstrate that our proposed locally-trained method produces higher-quality reconstructions compared to models trained globally on larger datasets.  ( 2 min )
    Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features. (arXiv:1911.05350v2 [stat.ML] UPDATED)
    Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive efficient large-scale learning algorithms. In this study, we consider solving a binary classification problem using random features and stochastic gradient descent. In recent research, an exponential convergence rate of the expected classification error under the strong low-noise condition has been shown. We extend these analyses to a random features setting, analyzing the error induced by the approximation of random features in terms of the distance between the generated hypothesis including population risk minimizers and empirical risk minimizers when using general Lipschitz loss functions, to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. Additionally, we demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems because of the strong low-noise condition.
    A Fair Comparison of Two Popular Flat Minima Optimizers: Stochastic Weight Averaging vs. Sharpness-Aware Minimization. (arXiv:2202.00661v3 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low loss neighborhoods, have been shown to improve upon stochastic and adaptive gradient-based optimizers for training neural networks. Two methods have received significant attention due to their impressive generalization performance and scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them. Previous work mainly evaluated SWA and SAM on different architectures and datasets. We fill this gap here by comparing the loss surfaces of the models trained with each method and through a broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover a number of surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
    Understanding Nesterov's Acceleration via Proximal Point Method. (arXiv:2005.08304v3 [math.OC] UPDATED)
    The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with convergence analyses of different versions of Nesterov's accelerated gradient method (AGM). The key observation is that AGM is a simple approximation of PPM, which results in an elementary derivation of the update equations and stepsizes of AGM. This view also leads to a transparent and conceptually simple analysis of AGM's convergence by using the analysis of PPM. The derivations also naturally extend to the strongly convex case. Ultimately, the results presented in this paper are of both didactic and conceptual value; they unify and explain existing variants of AGM while motivating other accelerated methods for practically relevant settings.  ( 2 min )
    Know Your Boundaries: The Necessity of Explicit Behavioral Cloning in Offline RL. (arXiv:2206.00695v1 [cs.LG])
    We introduce an offline reinforcement learning (RL) algorithm that explicitly clones a behavior policy to constrain value learning. In offline RL, it is often important to prevent a policy from selecting unobserved actions, since the consequence of these actions cannot be presumed without additional information about the environment. One straightforward way to implement such a constraint is to explicitly model a given data distribution via behavior cloning and directly force a policy not to select uncertain actions. However, many offline RL methods instantiate the constraint indirectly -- for example, pessimistic value estimation -- due to a concern about errors when modeling a potentially complex behavior policy. In this work, we argue that it is not only viable but beneficial to explicitly model the behavior policy for offline RL because the constraint can be realized in a stable way with the trained model. We first suggest a theoretical framework that allows us to incorporate behavior-cloned models into value-based offline RL methods, enjoying the strength of both explicit behavior cloning and value learning. Then, we propose a practical method utilizing a score-based generative model for behavior cloning. With the proposed method, we show state-of-the-art performance on several datasets within the D4RL and Robomimic benchmarks and achieve competitive performance across all datasets tested.
    Sequential Bayesian Neural Subnetwork Ensembles. (arXiv:2206.00794v1 [stat.ML])
    Deep neural network ensembles that appeal to model diversity have been used successfully to improve predictive performance and model robustness in several applications. Whereas, it has recently been shown that sparse subnetworks of dense models can match the performance of their dense counterparts and increase their robustness while effectively decreasing the model complexity. However, most ensembling techniques require multiple parallel and costly evaluations and have been proposed primarily with deterministic models, whereas sparsity induction has been mostly done through ad-hoc pruning. We propose sequential ensembling of dynamic Bayesian neural subnetworks that systematically reduce model complexity through sparsity-inducing priors and generate diverse ensembles in a single forward pass of the model. The ensembling strategy consists of an exploration phase that finds high-performing regions of the parameter space and multiple exploitation phases that effectively exploit the compactness of the sparse model to quickly converge to different minima in the energy landscape corresponding to high-performing subnetworks yielding diverse ensembles. We empirically demonstrate that our proposed approach surpasses the baselines of the dense frequentist and Bayesian ensemble models in prediction accuracy, uncertainty estimation, and out-of-distribution (OoD) robustness on CIFAR10, CIFAR100 datasets, and their out-of-distribution variants: CIFAR10-C, CIFAR100-C induced by corruptions. Furthermore, we found that our approach produced the most diverse ensembles compared to the approaches with a single forward pass and even compared to the approaches with multiple forward passes in some cases.
    DASO: Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning. (arXiv:2106.05682v2 [cs.CV] UPDATED)
    The capability of the traditional semi-supervised learning (SSL) methods is far from real-world application due to severely biased pseudo-labels caused by (1) class imbalance and (2) class distribution mismatch between labeled and unlabeled data. This paper addresses such a relatively under-explored problem. First, we propose a general pseudo-labeling framework that class-adaptively blends the semantic pseudo-label from a similarity-based classifier to the linear one from the linear classifier, after making the observation that both types of pseudo-labels have complementary properties in terms of bias. We further introduce a novel semantic alignment loss to establish balanced feature representation to reduce the biased predictions from the classifier. We term the whole framework as Distribution-Aware Semantics-Oriented (DASO) Pseudo-label. We conduct extensive experiments in a wide range of imbalanced benchmarks: CIFAR10/100-LT, STL10-LT, and large-scale long-tailed Semi-Aves with open-set class, and demonstrate that, the proposed DASO framework reliably improves SSL learners with unlabeled data especially when both (1) class imbalance and (2) distribution mismatch dominate.  ( 2 min )
    Fictitious play in zero-sum stochastic games. (arXiv:2010.04223v6 [cs.GT] UPDATED)
    We present a novel variant of fictitious play dynamics combining classical fictitious play with Q-learning for stochastic games and analyze its convergence properties in two-player zero-sum stochastic games. Our dynamics involves players forming beliefs on the opponent strategy and their own continuation payoff (Q-function), and playing a greedy best response by using the estimated continuation payoffs. Players update their beliefs from observations of opponent actions. A key property of the learning dynamics is that update of the beliefs on Q-functions occurs at a slower timescale than update of the beliefs on strategies. We show both in the model-based and model-free cases (without knowledge of player payoff functions and state transition probabilities), the beliefs on strategies converge to a stationary mixed Nash equilibrium of the zero-sum stochastic game.
    Incorporating Explicit Uncertainty Estimates into Deep Offline Reinforcement Learning. (arXiv:2206.01085v1 [cs.LG])
    Most theoretically motivated work in the offline reinforcement learning setting requires precise uncertainty estimates. This requirement restricts the algorithms derived in that work to the tabular and linear settings where such estimates exist. In this work, we develop a novel method for incorporating scalable uncertainty estimates into an offline reinforcement learning algorithm called deep-SPIBB that extends the SPIBB family of algorithms to environments with larger state and action spaces. We use recent innovations in uncertainty estimation from the deep learning community to get more scalable uncertainty estimates to plug into deep-SPIBB. While these uncertainty estimates do not allow for the same theoretical guarantees as in the tabular case, we argue that the SPIBB mechanism for incorporating uncertainty is more robust and flexible than pessimistic approaches that incorporate the uncertainty as a value function penalty. We bear this out empirically, showing that deep-SPIBB outperforms pessimism based approaches with access to the same uncertainty estimates and performs at least on par with a variety of other strong baselines across several environments and datasets.
    Temporal Knowledge Graph Forecasting with Neural ODE. (arXiv:2101.05151v3 [cs.LG] UPDATED)
    There has been an increasing interest in inferring future links on temporal knowledge graphs (KG). While links on temporal KGs vary continuously over time, the existing approaches model the temporal KGs in discrete state spaces. To this end, we propose a novel continuum model by extending the idea of neural ordinary differential equations (ODEs) to multi-relational graph convolutional networks. The proposed model preserves the continuous nature of dynamic multi-relational graph data and encodes both temporal and structural information into continuous-time dynamic embeddings. In addition, a novel graph transition layer is applied to capture the transitions on the dynamic graph, i.e., edge formation and dissolution. We perform extensive experiments on five benchmark datasets for temporal KG reasoning, showing our model's superior performance on the future link forecasting task.
    Faster Rates of Convergence to Stationary Points in Differentially Private Optimization. (arXiv:2206.00846v1 [cs.LG])
    We study the problem of approximating stationary points of Lipschitz and smooth functions under $(\varepsilon,\delta)$-differential privacy (DP) in both the finite-sum and stochastic settings. A point $\widehat{w}$ is called an $\alpha$-stationary point of a function $F:\mathbb{R}^d\rightarrow\mathbb{R}$ if $\|\nabla F(\widehat{w})\|\leq \alpha$. We provide a new efficient algorithm that finds an $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{2/3}\big)$-stationary point in the finite-sum setting, where $n$ is the number of samples. This improves on the previous best rate of $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$. We also give a new construction that improves over the existing rates in the stochastic optimization setting, where the goal is to find approximate stationary points of the population risk. Our construction finds a $\tilde{O}\big(\frac{1}{n^{1/3}} + \big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$-stationary point of the population risk in time linear in $n$. Furthermore, under the additional assumption of convexity, we completely characterize the sample complexity of finding stationary points of the population risk (up to polylog factors) and show that the optimal rate on population stationarity is $\tilde \Theta\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\big)$. Finally, we show that our methods can be used to provide dimension-independent rates of $O\big(\frac{1}{\sqrt{n}}+\min\big(\big[\frac{\sqrt{rank}}{n\varepsilon}\big]^{2/3},\frac{1}{(n\varepsilon)^{2/5}}\big)\big)$ on population stationarity for Generalized Linear Models (GLM), where $rank$ is the rank of the design matrix, which improves upon the previous best known rate.
    DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. (arXiv:2206.00927v1 [cs.LG])
    Diffusion probabilistic models (DPMs) are emerging powerful generative models. Despite their high-quality generation performance, DPMs still suffer from their slow sampling as they generally need hundreds or thousands of sequential function evaluations (steps) of large neural networks to draw a sample. Sampling from DPMs can be viewed alternatively as solving the corresponding diffusion ordinary differential equations (ODEs). In this work, we propose an exact formulation of the solution of diffusion ODEs. The formulation analytically computes the linear part of the solution, rather than leaving all terms to black-box ODE solvers as adopted in previous works. By applying change-of-variable, the solution can be equivalently simplified to an exponentially weighted integral of the neural network. Based on our formulation, we propose DPM-Solver, a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for both discrete-time and continuous-time DPMs without any further training. Experimental results show that DPM-Solver can generate high-quality samples in only 10 to 20 function evaluations on various datasets. We achieve 4.70 FID in 10 function evaluations and 2.87 FID in 20 function evaluations on the CIFAR10 dataset, and a $4\sim 16\times$ speedup compared with previous state-of-the-art training-free samplers on various datasets.
    Unveiling The Mask of Position-Information Pattern Through the Mist of Image Features. (arXiv:2206.01202v1 [cs.CV])
    Recent studies show that paddings in convolutional neural networks encode absolute position information which can negatively affect the model performance for certain tasks. However, existing metrics for quantifying the strength of positional information remain unreliable and frequently lead to erroneous results. To address this issue, we propose novel metrics for measuring (and visualizing) the encoded positional information. We formally define the encoded information as PPP (Position-information Pattern from Padding) and conduct a series of experiments to study its properties as well as its formation. The proposed metrics measure the presence of positional information more reliably than the existing metrics based on PosENet and a test in F-Conv. We also demonstrate that for any extant (and proposed) padding schemes, PPP is primarily a learning artifact and is less dependent on the characteristics of the underlying padding schemes.
    Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search. (arXiv:2206.00702v1 [cs.AI])
    Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreachable subgoals swiftly and thus allowing to focus on feasible further subgoals. In this way, AdaSubS benefits from the efficiency of planning with longer subgoals and the fine control with the shorter ones. We show that AdaSubS significantly surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik's Cube, and inequality proving benchmark INT, setting new state-of-the-art on INT.
    On the Global Convergence Rates of Softmax Policy Gradient Methods. (arXiv:2005.06392v3 [cs.LG] UPDATED)
    We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.
    Core-periphery Models for Hypergraphs. (arXiv:2206.00783v1 [cs.SI])
    We introduce a random hypergraph model for core-periphery structure. By leveraging our model's sufficient statistics, we develop a novel statistical inference algorithm that is able to scale to large hypergraphs with runtime that is practically linear wrt. the number of nodes in the graph after a preprocessing step that is almost linear in the number of hyperedges, as well as a scalable sampling algorithm. Our inference algorithm is capable of learning embeddings that correspond to the reputation (rank) of a node within the hypergraph. We also give theoretical bounds on the size of the core of hypergraphs generated by our model. We experiment with hypergraph data that range to $\sim 10^5$ hyperedges mined from the Microsoft Academic Graph, Stack Exchange, and GitHub and show that our model outperforms baselines wrt. producing good fits.
    ORC: Network Group-based Knowledge Distillation using Online Role Change. (arXiv:2206.01186v1 [cs.LG])
    In knowledge distillation, since a single, omnipotent teacher network cannot solve all problems, multiple teacher-based knowledge distillations have been studied recently. However, sometimes their improvements are not as good as expected because some immature teachers may transfer the false knowledge to the student. In this paper, to overcome this limitation and take the efficacy of the multiple networks, we divide the multiple networks into teacher and student groups, respectively. That is, the student group is a set of immature networks that require learning the teacher's knowledge, while the teacher group consists of the selected networks that have performed well. Furthermore, according to our online role change strategy, the top-ranked networks in the student group are able to promote to the teacher group at every iteration and vice versa. After training the teacher group using the error images of the student group to refine the teacher group's knowledge, we transfer the collective knowledge from the teacher group to the student group successfully. We verify the superiority of the proposed method on CIFAR-10 and CIFAR-100, which achieves high performance. We further show the generality of our method with various backbone architectures such as resent, wrn, vgg, mobilenet, and shufflenet.
    Graph Autoencoders for Embedding Learning in Brain Networks and Major Depressive Disorder Identification. (arXiv:2107.12838v2 [q-bio.NC] UPDATED)
    Brain functional connectivity (FC) reveals biomarkers for identification of various neuropsychiatric disorders. Recent application of deep neural networks (DNNs) to connectome-based classification mostly relies on traditional convolutional neural networks using input connectivity matrices on a regular Euclidean grid. We propose a graph deep learning framework to incorporate the non-Euclidean information about graph structure for classifying functional magnetic resonance imaging (fMRI)-derived brain networks in major depressive disorder (MDD). We design a novel graph autoencoder (GAE) architecture based on the graph convolutional networks (GCNs) to embed the topological structure and node content of large-sized fMRI networks into low-dimensional latent representations. In network construction, we employ the Ledoit-Wolf (LDW) shrinkage method to estimate the high-dimensional FC metrics efficiently from fMRI data. We consider both supervised and unsupervised approaches for the graph embedding learning. The learned embeddings are then used as feature inputs for a deep fully-connected neural network (FCNN) to discriminate MDD from healthy controls. Evaluated on two resting-state fMRI (rs-fMRI) MDD datasets, results show that the proposed GAE-FCNN model significantly outperforms several state-of-the-art methods for brain connectome classification, achieving the best accuracy using the LDW-FC edges as node features. The graph embeddings of fMRI FC networks learned by the GAE also reveal apparent group differences between MDD and HC. Our new framework demonstrates feasibility of learning graph embeddings on brain networks to provide discriminative information for diagnosis of brain disorders.
    DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. (arXiv:2110.02711v5 [cs.CV] UPDATED)
    Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts. To mitigate these problems and enable faithful manipulation of real images, we propose a novel method, dubbed DiffusionCLIP, that performs text-driven image manipulation using diffusion models. Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zero-shot image manipulation successfully even between unseen domains and takes another step towards general application by manipulating images from a widely varying ImageNet dataset. Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation. Extensive experiments and human evaluation confirmed robust and superior manipulation performance of our methods compared to the existing baselines. Code is available at https://github.com/gwang-kim/DiffusionCLIP.git.
    A Log-Linear Time Sequential Optimal Calibration Algorithm for Quantized Isotonic L2 Regression. (arXiv:2206.00744v1 [cs.LG])
    We study the sequential calibration of estimations in a quantized isotonic L2 regression setting. We start by showing that the optimal calibrated quantized estimations can be acquired from the traditional isotonic L2 regression solution. We modify the traditional PAVA algorithm to create calibrators for both batch and sequential optimization of the quantized isotonic regression problem. Our algorithm can update the optimal quantized monotone mapping for the samples observed so far in linear space and logarithmic time per new unordered sample.
    Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. (arXiv:2206.01078v1 [cs.LG])
    Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.  ( 2 min )
    Query Processing on Tensor Computation Runtimes. (arXiv:2203.01877v2 [cs.DB] UPDATED)
    The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how databases can ride the wave of innovation happening in the AI space. We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10$\times$ over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9$\times$ speedup over CPU baselines.
    Improving Diffusion Models for Inverse Problems using Manifold Constraints. (arXiv:2206.00941v1 [cs.LG])
    Recently, diffusion models have been used to solve various inverse problems in an unsupervised manner with appropriate modifications to the sampling process. However, the current solvers, which recursively apply a reverse diffusion step followed by a measurement consistency step, often produce sub-optimal results. By studying the generative sampling path, here we show that current solvers throw the sample path off the data manifold, and hence the error accumulates. To address this, we propose an additional correction term inspired by the manifold constraint, which can be used synergistically with the previous solvers to make the iterations close to the manifold. The proposed manifold constraint is straightforward to implement within a few lines of code, yet boosts the performance by a surprisingly large margin. With extensive experiments, we show that our method is superior to the previous methods both theoretically and empirically, producing promising results in many applications such as image inpainting, colorization, and sparse-view computed tomography.  ( 2 min )
    Watch Out for the Safety-Threatening Actors: Proactively Mitigating Safety Hazards. (arXiv:2206.00886v1 [cs.RO])
    Despite the successful demonstration of autonomous vehicles (AVs), such as self-driving cars, ensuring AV safety remains a challenging task. Although some actors influence an AV's driving decisions more than others, current approaches pay equal attention to each actor on the road. An actor's influence on the AV's decision can be characterized in terms of its ability to decrease the number of safe navigational choices for the AV. In this work, we propose a safety threat indicator (STI) using counterfactual reasoning to estimate the importance of each actor on the road with respect to its influence on the AV's safety. We use this indicator to (i) characterize the existing real-world datasets to identify rare hazardous scenarios as well as the poor performance of existing controllers in such scenarios; and (ii) design an RL based safety mitigation controller to proactively mitigate the safety hazards those actors pose to the AV. Our approach reduces the accident rate for the state-of-the-art AV agent(s) in rare hazardous scenarios by more than 70%.
    Self-Consistency of the Fokker-Planck Equation. (arXiv:2206.00860v1 [cs.LG])
    The Fokker-Planck equation (FPE) is the partial differential equation that governs the density evolution of the It\^o process and is of great importance to the literature of statistical physics and machine learning. The FPE can be regarded as a continuity equation where the change of the density is completely determined by a time varying velocity field. Importantly, this velocity field also depends on the current density function. As a result, the ground-truth velocity field can be shown to be the solution of a fixed-point equation, a property that we call self-consistency. In this paper, we exploit this concept to design a potential function of the hypothesis velocity fields, and prove that, if such a function diminishes to zero during the training procedure, the trajectory of the densities generated by the hypothesis velocity fields converges to the solution of the FPE in the Wasserstein-2 sense. The proposed potential function is amenable to neural-network based parameterization as the stochastic gradient with respect to the parameter can be efficiently computed. Once a parameterized model, such as Neural Ordinary Differential Equation is trained, we can generate the entire trajectory to the FPE.  ( 2 min )
    $\alpha$NAS: Neural Architecture Search using Property Guided Synthesis. (arXiv:2205.03960v2 [cs.LG] UPDATED)
    In the past few years, neural architecture search (NAS) has become an increasingly important tool within the deep learning community. Despite the many recent successes of NAS, however, most existing approaches operate within highly structured design spaces, and hence explore only a small fraction of the full search space of neural architectures while also requiring significant manual effort from domain experts. In this work, we develop techniques that enable efficient NAS in a significantly larger design space. To accomplish this, we propose to perform NAS in an abstract search space of program properties. Our key insights are as follows: (1) the abstract search space is significantly smaller than the original search space, and (2) architectures with similar program properties also have similar performance; thus, we can search more efficiently in the abstract search space. To enable this approach, we also propose a novel efficient synthesis procedure, which accepts a set of promising program properties, and returns a satisfying neural architecture. We implement our approach, $\alpha$NAS, within an evolutionary framework, where the mutations are guided by the program properties. Starting with a ResNet-34 model, $\alpha$NAS produces a model with slightly improved accuracy on CIFAR-10 but 96% fewer parameters. On ImageNet, $\alpha$NAS is able to improve over Vision Transformer (30% fewer FLOPS and parameters), ResNet-50 (23% fewer FLOPS, 14% fewer parameters), and EfficientNet (7% fewer FLOPS and parameters) without any degradation in accuracy.  ( 2 min )
    NIPQ: Noise Injection Pseudo Quantization for Automated DNN Optimization. (arXiv:2206.00820v1 [cs.LG])
    The optimization of neural networks in terms of computation cost and memory footprint is crucial for their practical deployment on edge devices. In this work, we propose a novel quantization-aware training (QAT) scheme called noise injection pseudo quantization (NIPQ). NIPQ is implemented based on pseudo quantization noise (PQN) and has several advantages. First, both activation and weight can be quantized based on a unified framework. Second, the hyper-parameters of quantization (e.g., layer-wise bit-width and quantization interval) are automatically tuned. Third, after QAT, the network has robustness against quantization, thereby making it easier to deploy in practice. To validate the superiority of the proposed algorithm, we provide extensive analysis and conduct diverse experiments for various vision applications. Our comprehensive experiments validate the outstanding performance of the proposed algorithm in several aspects.
    Hyperspherical Consistency Regularization. (arXiv:2206.00845v1 [cs.LG])
    Recent advances in contrastive learning have enlightened diverse applications across various semi-supervised fields. Jointly training supervised learning and unsupervised learning with a shared feature encoder becomes a common scheme. Though it benefits from taking advantage of both feature-dependent information from self-supervised learning and label-dependent information from supervised learning, this scheme remains suffering from bias of the classifier. In this work, we systematically explore the relationship between self-supervised learning and supervised learning, and study how self-supervised learning helps robust data-efficient deep learning. We propose hyperspherical consistency regularization (HCR), a simple yet effective plug-and-play method, to regularize the classifier using feature-dependent information and thus avoid bias from labels. Specifically, HCR first projects logits from the classifier and feature projections from the projection head on the respective hypersphere, then it enforces data points on hyperspheres to have similar structures by minimizing binary cross entropy of pairwise distances' similarity metrics. Extensive experiments on semi-supervised and weakly-supervised learning demonstrate the effectiveness of our method, by showing superior performance with HCR.
    Defense Against Gradient Leakage Attacks via Learning to Obscure Data. (arXiv:2206.00769v1 [cs.LG])
    Federated learning is considered as an effective privacy-preserving learning mechanism that separates the client's data and model training process. However, federated learning is still under the risk of privacy leakage because of the existence of attackers who deliberately conduct gradient leakage attacks to reconstruct the client data. Recently, popular strategies such as gradient perturbation methods and input encryption methods have been proposed to defend against gradient leakage attacks. Nevertheless, these defenses can either greatly sacrifice the model performance, or be evaded by more advanced attacks. In this paper, we propose a new defense method to protect the privacy of clients' data by learning to obscure data. Our defense method can generate synthetic samples that are totally distinct from the original samples, but they can also maximally preserve their predictive features and guarantee the model performance. Furthermore, our defense strategy makes the gradient leakage attack and its variants extremely difficult to reconstruct the client data. Through extensive experiments, we show that our proposed defense method obtains better privacy protection while preserving high accuracy compared with state-of-the-art methods.  ( 2 min )
    Cascaded Video Generation for Videos In-the-Wild. (arXiv:2206.00735v1 [cs.CV])
    Videos can be created by first outlining a global view of the scene and then adding local details. Inspired by this idea we propose a cascaded model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, which is then refined by subsequent cascade levels operating at larger resolutions. We train each cascade level sequentially on partial views of the videos, which reduces the computational complexity of our model and makes it scalable to high-resolution videos with many frames. We empirically validate our approach on UCF101 and Kinetics-600, for which our model is competitive with the state-of-the-art. We further demonstrate the scaling capabilities of our model and train a three-level model on the BDD100K dataset which generates 256x256 pixels videos with 48 frames.
    Revisiting the General Identifiability Problem. (arXiv:2206.01081v1 [cs.LG])
    We revisit the problem of general identifiability originally introduced in [Lee et al., 2019] for causal inference and note that it is necessary to add positivity assumption of observational distribution to the original definition of the problem. We show that without such an assumption the rules of do-calculus and consequently the proposed algorithm in [Lee et al., 2019] are not sound. Moreover, adding the assumption will cause the completeness proof in [Lee et al., 2019] to fail. Under positivity assumption, we present a new algorithm that is provably both sound and complete. A nice property of this new algorithm is that it establishes a connection between general identifiability and classical identifiability by Pearl [1995] through decomposing the general identifiability problem into a series of classical identifiability sub-problems.  ( 2 min )
    Federated Learning with a Sampling Algorithm under Isoperimetry. (arXiv:2206.00920v1 [cs.LG])
    Federated learning uses a set of techniques to efficiently distribute the training of a machine learning algorithm across several devices, who own the training data. These techniques critically rely on reducing the communication cost -- the main bottleneck -- between the devices and a central server. Federated learning algorithms usually take an optimization approach: they are algorithms for minimizing the training loss subject to communication (and other) constraints. In this work, we instead take a Bayesian approach for the training task, and propose a communication-efficient variant of the Langevin algorithm to sample a posteriori. The latter approach is more robust and provides more knowledge of the \textit{a posteriori} distribution than its optimization counterpart. We analyze our algorithm without assuming that the target distribution is strongly log-concave. Instead, we assume the weaker log Sobolev inequality, which allows for nonconvexity.  ( 2 min )
    On the Effectiveness of Knowledge Graph Embeddings: a Rule Mining Approach. (arXiv:2206.00983v1 [cs.LG])
    We study the effectiveness of Knowledge Graph Embeddings (KGE) for knowledge graph (KG) completion with rule mining. More specifically, we mine rules from KGs before and after they have been completed by a KGE to compare possible differences in the rules extracted. We apply this method to classical KGEs approaches, in particular, TransE, DistMult and ComplEx. Our experiments indicate that there can be huge differences between the extracted rules, depending on the KGE approach for KG completion. In particular, after the TransE completion, several spurious rules were extracted.  ( 2 min )
    Causal Structure Learning: a Combinatorial Perspective. (arXiv:2206.01152v1 [stat.ME])
    In this review, we discuss approaches for learning causal structure from data, also called causal discovery. In particular, we focus on approaches for learning directed acyclic graphs (DAGs) and various generalizations which allow for some variables to be unobserved in the available data. We devote special attention to two fundamental combinatorial aspects of causal structure learning. First, we discuss the structure of the search space over causal graphs. Second, we discuss the structure of equivalence classes over causal graphs, i.e., sets of graphs which represent what can be learned from observational data alone, and how these equivalence classes can be refined by adding interventional data.
    Generating Sparse Counterfactual Explanations For Multivariate Time Series. (arXiv:2206.00931v1 [cs.LG])
    Since neural networks play an increasingly important role in critical sectors, explaining network predictions has become a key research topic. Counterfactual explanations can help to understand why classifier models decide for particular class assignments and, moreover, how the respective input samples would have to be modified such that the class prediction changes. Previous approaches mainly focus on image and tabular data. In this work we propose SPARCE, a generative adversarial network (GAN) architecture that generates SPARse Counterfactual Explanations for multivariate time series. Our approach provides a custom sparsity layer and regularizes the counterfactual loss function in terms of similarity, sparsity, and smoothness of trajectories. We evaluate our approach on real-world human motion datasets as well as a synthetic time series interpretability benchmark. Although we make significantly sparser modifications than other approaches, we achieve comparable or better performance on all metrics. Moreover, we demonstrate that our approach predominantly modifies salient time steps and features, leaving non-salient inputs untouched.
    DPar2: Fast and Scalable PARAFAC2 Decomposition for Irregular Dense Tensors. (arXiv:2203.12798v2 [cs.LG] UPDATED)
    Given an irregular dense tensor, how can we efficiently analyze it? An irregular tensor is a collection of matrices whose columns have the same size and rows have different sizes from each other. PARAFAC2 decomposition is a fundamental tool to deal with an irregular tensor in applications including phenotype discovery and trend analysis. Although several PARAFAC2 decomposition methods exist, their efficiency is limited for irregular dense tensors due to the expensive computations involved with the tensor. In this paper, we propose DPar2, a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. DPar2 achieves high efficiency by effectively compressing each slice matrix of a given irregular tensor, careful reordering of computations with the compression results, and exploiting the irregularity of the tensor. Extensive experiments show that DPar2 is up to 6.0x faster than competitors on real-world irregular tensors while achieving comparable accuracy. In addition, DPar2 is scalable with respect to the tensor size and target rank.
    Dataset Distillation using Neural Feature Regression. (arXiv:2206.00719v1 [cs.LG])
    Dataset distillation aims to learn a small synthetic dataset that preserves most of the information from the original dataset. Dataset distillation can be formulated as a bi-level meta-learning problem where the outer loop optimizes the meta-dataset and the inner loop trains a model on the distilled data. Meta-gradient computation is one of the key challenges in this formulation, as differentiating through the inner loop learning procedure introduces significant computation and memory costs. In this paper, we address these challenges using neural Feature Regression with Pooling (FRePo), achieving the state-of-the-art performance with an order of magnitude less memory requirement and two orders of magnitude faster training than previous methods. The proposed algorithm is analogous to truncated backpropagation through time with a pool of models to alleviate various types of overfitting in dataset distillation. FRePo significantly outperforms the previous methods on CIFAR100, Tiny ImageNet, and ImageNet-1K. Furthermore, we show that high-quality distilled data can greatly improve various downstream applications, such as continual learning and membership inference defense.
    Federated Learning under Distributed Concept Drift. (arXiv:2206.00799v1 [cs.LG])
    Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). Our work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation, with their single global model, are ill-suited to staggered drifts, necessitating multi-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step.  ( 2 min )
    Learning code summarization from a small and local dataset. (arXiv:2206.00804v1 [cs.SE])
    Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.  ( 2 min )
    Collaboration Equilibrium in Federated Learning. (arXiv:2108.07926v3 [cs.LG] UPDATED)
    Federated learning (FL) refers to the paradigm of learning models over a collaborative research network involving multiple clients without sacrificing privacy. Recently, there have been rising concerns on the distributional discrepancies across different clients, which could even cause counterproductive consequences when collaborating with others. While it is not necessarily that collaborating with all clients will achieve the best performance, in this paper, we study a rational collaboration called ``collaboration equilibrium'' (CE), where smaller collaboration coalitions are formed. Each client collaborates with certain members who maximally improve the model learning and isolates the others who make little contribution. We propose the concept of benefit graph which describes how each client can benefit from collaborating with other clients and advance a Pareto optimization approach to identify the optimal collaborators. Then we theoretically prove that we can reach a CE from the benefit graph through an iterative graph operation. Our framework provides a new way of setting up collaborations in a research network. Experiments on both synthetic and real world data sets are provided to demonstrate the effectiveness of our method.
    Residual Multiplicative Filter Networks for Multiscale Reconstruction. (arXiv:2206.00746v1 [cs.CV])
    Coordinate networks like Multiplicative Filter Networks (MFNs) and BACON offer some control over the frequency spectrum used to represent continuous signals such as images or 3D volumes. Yet, they are not readily applicable to problems for which coarse-to-fine estimation is required, including various inverse problems in which coarse-to-fine optimization plays a key role in avoiding poor local minima. We introduce a new coordinate network architecture and training scheme that enables coarse-to-fine optimization with fine-grained control over the frequency support of learned reconstructions. This is achieved with two key innovations. First, we incorporate skip connections so that structure at one scale is preserved when fitting finer-scale structure. Second, we propose a novel initialization scheme to provide control over the model frequency spectrum at each stage of optimization. We demonstrate how these modifications enable multiscale optimization for coarse-to-fine fitting to natural images. We then evaluate our model on synthetically generated datasets for the the problem of single-particle cryo-EM reconstruction. We learn high resolution multiscale structures, on par with the state-of-the art.
    Availability Attacks Create Shortcuts. (arXiv:2111.00898v2 [cs.LG] UPDATED)
    Availability attacks, which poison the training data with imperceptible perturbations, can make the data \emph{not exploitable} by machine learning algorithms so as to prevent unauthorized use of data. In this work, we investigate why these perturbations work in principle. We are the first to unveil an important population property of the perturbations of these attacks: they are almost \textbf{linearly separable} when assigned with the target labels of the corresponding samples, which hence can work as \emph{shortcuts} for the learning objective. We further verify that linear separability is indeed the workhorse for availability attacks. We synthesize linearly-separable perturbations as attacks and show that they are as powerful as the deliberately crafted attacks. Moreover, such synthetic perturbations are much easier to generate. For example, previous attacks need dozens of hours to generate perturbations for ImageNet while our algorithm only needs several seconds. Our finding also suggests that the \emph{shortcut learning} is more widely present than previously believed as deep models would rely on shortcuts even if they are of an imperceptible scale and mixed together with the normal features. Our source code is published at \url{https://github.com/dayu11/Availability-Attacks-Create-Shortcuts}.
    Shortest Path Networks for Graph Property Prediction. (arXiv:2206.01003v1 [cs.LG])
    Most graph neural network models rely on a particular message passing paradigm, where the idea is to iteratively propagate node representations of a graph to each node in the direct neighborhood. While very prominent, this paradigm leads to information propagation bottlenecks, as information is repeatedly compressed at intermediary node representations, which causes loss of information, making it practically impossible to gather meaningful signals from distant nodes. To address this issue, we propose shortest path message passing neural networks, where the node representations of a graph are propagated to each node in the shortest path neighborhoods. In this setting, nodes can directly communicate between each other even if they are not neighbors, breaking the information bottleneck and hence leading to more adequately learned representations. Theoretically, our framework generalizes message passing neural networks, resulting in provably more expressive models. Empirically, we verify the capacity of a basic model of this framework on dedicated synthetic experiments, and on real-world graph classification and regression benchmarks, obtaining several state-of-the-art results.
    The Phenomenon of Policy Churn. (arXiv:2206.00730v1 [cs.LG])
    We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $\epsilon$-greedy exploration in a fresh light, namely that $\epsilon$-noise plays a much smaller role than expected.
    Towards real-world navigation with deep differentiable planners. (arXiv:2108.05713v2 [cs.RO] UPDATED)
    We train embodied neural networks to plan and navigate unseen complex 3D environments, emphasising real-world deployment. Rather than requiring prior knowledge of the agent or environment, the planner learns to model the state transitions and rewards. To avoid the potentially hazardous trial-and-error of reinforcement learning, we focus on differentiable planners such as Value Iteration Networks (VIN), which are trained offline from safe expert demonstrations. Although they work well in small simulations, we address two major limitations that hinder their deployment. First, we observed that current differentiable planners struggle to plan long-term in environments with a high branching complexity. While they should ideally learn to assign low rewards to obstacles to avoid collisions, we posit that the constraints imposed on the network are not strong enough to guarantee the network to learn sufficiently large penalties for every possible collision. We thus impose a structural constraint on the value iteration, which explicitly learns to model any impossible actions. Secondly, we extend the model to work with a limited perspective camera under translation and rotation, which is crucial for real robot deployment. Many VIN-like planners assume a 360 degrees or overhead view without rotation. In contrast, our method uses a memory-efficient lattice map to aggregate CNN embeddings of partial observations, and models the rotational dynamics explicitly using a 3D state-space grid (translation and rotation). Our proposals significantly improve semantic navigation and exploration on several 2D and 3D environments, succeeding in settings that are otherwise challenging for this class of methods. As far as we know, we are the first to successfully perform differentiable planning on the difficult Active Vision Dataset, consisting of real images captured from a robot.
    Offline Reinforcement Learning with Differential Privacy. (arXiv:2206.00810v1 [cs.LG])
    The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.  ( 2 min )
    Approximate Network Motif Mining Via Graph Learning. (arXiv:2206.01008v1 [cs.LG])
    Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many graph datasets. However, the high computational complexity of identifying motif sets in arbitrary datasets (motif mining) has limited their use in many real-world datasets. By automatically leveraging statistical properties of datasets, machine learning approaches have shown promise in several tasks with combinatorial complexity and are therefore a promising candidate for network motif mining. In this work we seek to facilitate the development of machine learning approaches aimed at motif mining. We propose a formulation of the motif mining problem as a node labelling task. In addition, we build benchmark datasets and evaluation metrics which test the ability of models to capture different aspects of motif discovery such as motif number, size, topology, and scarcity. Next, we propose MotiFiesta, a first attempt at solving this problem in a fully differentiable manner with promising results on challenging baselines. Finally, we demonstrate through MotiFiesta that this learning setting can be applied simultaneously to general-purpose data mining and interpretable feature extraction for graph classification tasks.
    Training privacy-preserving video analytics pipelines by suppressing features that reveal information about private attributes. (arXiv:2203.02635v2 [cs.CV] UPDATED)
    Deep neural networks are increasingly deployed for scene analytics, including to evaluate the attention and reaction of people exposed to out-of-home advertisements. However, the features extracted by a deep neural network that was trained to predict a specific, consensual attribute (e.g. emotion) may also encode and thus reveal information about private, protected attributes (e.g. age or gender). In this work, we focus on such leakage of private information at inference time. We consider an adversary with access to the features extracted by the layers of a deployed neural network and use these features to predict private attributes. To prevent the success of such an attack, we modify the training of the network using a confusion loss that encourages the extraction of features that make it difficult for the adversary to accurately predict private attributes. We validate this training approach on image-based tasks using a publicly available dataset. Results show that, compared to the original network, the proposed PrivateNet can reduce the leakage of private information of a state-of-the-art emotion recognition classifier by 2.88% for gender and by 13.06% for age group, with a minimal effect on task accuracy.
    SolarGAN: Synthetic Annual Solar Irradiance Time Series on Urban Building Facades via Deep Generative Networks. (arXiv:2206.00747v1 [cs.LG])
    Building Integrated Photovoltaics (BIPV) is a promising technology to decarbonize urban energy systems via harnessing solar energy available on building envelopes. While methods to assess solar irradiation, especially on rooftops, are well established, the assessment on building facades usually involves a higher effort due to more complex urban features and obstructions. The drawback of existing physics-based simulation programs is that they require significant manual modelling effort and computing time for generating time resolved deterministic results. Yet, solar irradiation is highly intermittent and representing its inherent uncertainty may be required for designing robust BIPV energy systems. Targeting on these drawbacks, this paper proposes a data-driven model based on Deep Generative Networks (DGN) to efficiently generate high-fidelity stochastic ensembles of annual hourly solar irradiance time series on building facades with uncompromised spatiotemporal resolution at the urban scale. The only input required is easily obtainable, simple fisheye images as categorical shading masks captured from 3D models. In principle, even actual photographs of urban contexts can be utilized, given they are semantically segmented. Our validations exemplify the high fidelity of the generated time series when compared to the physics-based simulator. To demonstrate the model's relevance for urban energy planning, we showcase its potential for generative design by parametrically altering characteristic features of the urban environment and producing corresponding time series on building facades under different climatic contexts in real-time.
    Multi-source Domain Adaptation via Weighted Joint Distributions Optimal Transport. (arXiv:2006.12938v2 [cs.LG] UPDATED)
    The problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and target shift both among the sources, and between the source and target domains. In this paper, we address this problem from a new perspective: instead of looking for a latent representation invariant between source and target domains, we exploit the diversity of source distributions by tuning their weights to the target task at hand. Our method, named Weighted Joint Distribution Optimal Transport (WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of-the-art performance on simulated and real-life datasets.
    RoCourseNet: Distributionally Robust Training of a Prediction Aware Recourse Model. (arXiv:2206.00700v1 [cs.LG])
    Counterfactual (CF) explanations for machine learning (ML) models are preferred by end-users, as they explain the predictions of ML models by providing a recourse case to individuals who are adversely impacted by predicted outcomes. Existing CF explanation methods generate recourses under the assumption that the underlying target ML model remains stationary over time. However, due to commonly occurring distributional shifts in training data, ML models constantly get updated in practice, which might render previously generated recourses invalid and diminish end-users trust in our algorithmic framework. To address this problem, we propose RoCourseNet, a training framework that jointly optimizes for predictions and robust recourses to future data shifts. We have three main contributions: (i) We propose a novel virtual data shift (VDS) algorithm to find worst-case shifted ML models by explicitly considering the worst-case data shift in the training dataset. (ii) We leverage adversarial training to solve a novel tri-level optimization problem inside RoCourseNet, which simultaneously generates predictions and corresponding robust recourses. (iii) Finally, we evaluate RoCourseNet's performance on three real-world datasets and show that RoCourseNet outperforms state-of-the-art baselines by 10% in generating robust CF explanations.
    Invertible Neural Networks for Graph Prediction. (arXiv:2206.01163v1 [stat.ML])
    In this work, we address conditional generation using deep invertible neural networks. This is a type of problem where one aims to infer the most probable inputs $X$ given outcomes $Y$. We call our method \textit{invertible graph neural network} (iGNN) due to the primary focus on generating node features on graph data. A notable feature of our proposed methods is that during network training, we revise the typically-used loss objective in normalizing flow and consider Wasserstein-2 regularization to facilitate the training process. Algorithmic-wise, we adopt an end-to-end training approach since our objective is to address prediction and generation in the forward and backward processes at once through a single model. Theoretically, we characterize the conditions for identifiability of a true mapping, the existence and invertibility of the mapping, and the expressiveness of iGNN in learning the mapping. Experimentally, we verify the performance of iGNN on both simulated and real-data datasets. We demonstrate through extensive numerical experiments that iGNN shows clear improvement over competing conditional generation benchmarks on high-dimensional and/or non-convex data.
    Regularized Nonlinear Regression for Simultaneously Selecting and Estimating Key Model Parameters. (arXiv:2104.11426v2 [stat.ME] UPDATED)
    In system identification, estimating parameters of a model using limited observations results in poor identifiability. To cope with this issue, we propose a new method to simultaneously select and estimate sensitive parameters as key model parameters and fix the remaining parameters to a set of typical values. Our method is formulated as a nonlinear least squares estimator with L1-regularization on the deviation of parameters from a set of typical values. First, we provide consistency and oracle properties of the proposed estimator as a theoretical foundation. Second, we provide a novel approach based on Levenberg-Marquardt optimization to numerically find the solution to the formulated problem. Third, to show the effectiveness, we present an application identifying a biomechanical parametric model of a head position tracking task for 10 human subjects from limited data. In a simulation study, the variances of estimated parameters are decreased by 96.1% as compared to that of the estimated parameters without L1-regularization. In an experimental study, our method improves the model interpretation by reducing the number of parameters to be estimated while maintaining variance accounted for (VAF) at above 82.5%. Moreover, the variances of estimated parameters are reduced by 71.1% as compared to that of the estimated parameters without L1-regularization. Our method is 54 times faster than the standard simplex-based optimization to solve the regularized nonlinear regression.
    Why Did This Model Forecast This Future? Closed-Form Temporal Saliency Towards Causal Explanations of Probabilistic Forecasts. (arXiv:2206.00679v1 [cs.LG])
    Forecasting tasks surrounding the dynamics of low-level human behavior are of significance to multiple research domains. In such settings, methods for explaining specific forecasts can enable domain experts to gain insights into the predictive relationships between behaviors. In this work, we introduce and address the following question: given a probabilistic forecasting model how can we identify observed windows that the model considers salient when making its forecasts? We build upon a general definition of information-theoretic saliency grounded in human perception and extend it to forecasting settings by leveraging a crucial attribute of the domain: a single observation can result in multiple valid futures. We propose to express the saliency of an observed window in terms of the differential entropy of the resulting predicted future distribution. In contrast to existing methods that either require explicit training of the saliency mechanism or access to the internal states of the forecasting model, we obtain a closed-form solution for the saliency map for commonly used density functions in probabilistic forecasting. We empirically demonstrate how our framework can recover salient observed windows from head pose features for the sample task of speaking-turn forecasting using a synthesized conversation dataset.
    Walk for Learning: A Random Walk Approach for Federated Learning from Heterogeneous Data. (arXiv:2206.00737v1 [cs.LG])
    We consider the problem of a Parameter Server (PS) that wishes to learn a model that fits data distributed on the nodes of a graph. We focus on Federated Learning (FL) as a canonical application. One of the main challenges of FL is the communication bottleneck between the nodes and the parameter server. A popular solution in the literature is to allow each node to do several local updates on the model in each iteration before sending it back to the PS. While this mitigates the communication bottleneck, the statistical heterogeneity of the data owned by the different nodes has proven to delay convergence and bias the model. In this work, we study random walk (RW) learning algorithms for tackling the communication and data heterogeneity problems. The main idea is to leverage available direct connections among the nodes themselves, which are typically "cheaper" than the communication to the PS. In a random walk, the model is thought of as a "baton" that is passed from a node to one of its neighbors after being updated in each iteration. The challenge in designing the RW is the data heterogeneity and the uncertainty about the data distributions. Ideally, we would want to visit more often nodes that hold more informative data. We cast this problem as a sleeping multi-armed bandit (MAB) to design a near-optimal node sampling strategy that achieves variance-reduced gradient estimates and approaches sub-linearly the optimal sampling strategy. Based on this framework, we present an adaptive random walk learning algorithm. We provide theoretical guarantees on its convergence. Our numerical results validate our theoretical findings and show that our algorithm outperforms existing random walk algorithms.
    A Communication-efficient Algorithm with Linear Convergence for Federated Minimax Learning. (arXiv:2206.01132v1 [cs.LG])
    In this paper, we study a large-scale multi-agent minimax optimization problem, which models many interesting applications in statistical learning and game theory, including Generative Adversarial Networks (GANs). The overall objective is a sum of agents' private local objective functions. We first analyze an important special case, empirical minimax problem, where the overall objective approximates a true population minimax risk by statistical samples. We provide generalization bounds for learning with this objective through Rademacher complexity analysis. Then, we focus on the federated setting, where agents can perform local computation and communicate with a central server. Most existing federated minimax algorithms either require communication per iteration or lack performance guarantees with the exception of Local Stochastic Gradient Descent Ascent (SGDA), a multiple-local-update descent ascent algorithm which guarantees convergence under a diminishing stepsize. By analyzing Local SGDA under the ideal condition of no gradient noise, we show that generally it cannot guarantee exact convergence with constant stepsizes and thus suffers from slow rates of convergence. To tackle this issue, we propose FedGDA-GT, an improved Federated (Fed) Gradient Descent Ascent (GDA) method based on Gradient Tracking (GT). When local objectives are Lipschitz smooth and strongly-convex-strongly-concave, we prove that FedGDA-GT converges linearly with a constant stepsize to global $\epsilon$-approximation solution with $\mathcal{O}(\log (1/\epsilon))$ rounds of communication, which matches the time complexity of centralized GDA method. Finally, we numerically show that FedGDA-GT outperforms Local SGDA.
    Split-kl and PAC-Bayes-split-kl Inequalities. (arXiv:2206.00706v1 [stat.ML])
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality combines the combinatorial power of the kl inequality with ability to exploit low variance. While for Bernoulli random variables the kl inequality is tighter than the Empirical Bernstein, for random variables taking values inside a bounded interval and having low variance the Empirical Bernstein inequality is tighter than the kl. The proposed split-kl inequality yields the best of both worlds. We discuss an application of the split-kl inequality to bounding excess losses. We also derive a PAC-Bayes-split-kl inequality and use a synthetic example and several UCI datasets to compare it with the PAC-Bayes-kl, PAC-Bayes Empirical Bernstein, PAC-Bayes Unexpected Bernstein, and PAC-Bayes Empirical Bennett inequalities.
    Leveraging Systematic Knowledge of 2D Transformations. (arXiv:2206.00893v1 [cs.CV])
    The existing deep learning models suffer from out-of-distribution (o.o.d.) performance drop in computer vision tasks. In comparison, humans have a remarkable ability to interpret images, even if the scenes in the images are rare, thanks to the systematicity of acquired knowledge. This work focuses on 1) the acquisition of systematic knowledge of 2D transformations, and 2) architectural components that can leverage the learned knowledge in image classification tasks in an o.o.d. setting. With a new training methodology based on synthetic datasets that are constructed under the causal framework, the deep neural networks acquire knowledge from semantically different domains (e.g. even from noise), and exhibit certain level of systematicity in parameter estimation experiments. Based on this, a novel architecture is devised consisting of a classifier, an estimator and an identifier (abbreviated as "CED"). By emulating the "hypothesis-verification" process in human visual perception, CED improves the classification accuracy significantly on test sets under covariate shift.
    Dataset Condensation via Efficient Synthetic-Data Parameterization. (arXiv:2205.14959v2 [cs.LG] UPDATED)
    The great success of machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning. Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset. However, the existing approaches have fundamental limitations in optimization due to the limited representability of synthetic datasets without considering any data regularity characteristics. To this end, we propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity. We further analyze the shortcomings of the existing gradient matching-based condensation methods and develop an effective optimization technique for improving the condensation of training data information. We propose a unified algorithm that drastically improves the quality of condensed data against the current state-of-the-art on CIFAR-10, ImageNet, and Speech Commands.
    Neural Decoding with Optimization of Node Activations. (arXiv:2206.00786v1 [cs.IT])
    The problem of maximum likelihood decoding with a neural decoder for error-correcting code is considered. It is shown that the neural decoder can be improved with two novel loss terms on the node's activations. The first loss term imposes a sparse constraint on the node's activations. Whereas, the second loss term tried to mimic the node's activations from a teacher decoder which has better performance. The proposed method has the same run time complexity and model size as the neural Belief Propagation decoder, while improving the decoding performance by up to $1.1dB$ on BCH codes.
    Feature Space Particle Inference for Neural Network Ensembles. (arXiv:2206.00944v1 [cs.LG])
    Ensembles of deep neural networks demonstrate improved performance over single models. For enhancing the diversity of ensemble members while keeping their performance, particle-based inference methods offer a promising approach from a Bayesian perspective. However, the best way to apply these methods to neural networks is still unclear: seeking samples from the weight-space posterior suffers from inefficiency due to the over-parameterization issues, while seeking samples directly from the function-space posterior often results in serious underfitting. In this study, we propose optimizing particles in the feature space where the activation of a specific intermediate layer lies to address the above-mentioned difficulties. Our method encourages each member to capture distinct features, which is expected to improve ensemble prediction robustness. Extensive evaluation on real-world datasets shows that our model significantly outperforms the gold-standard Deep Ensembles on various metrics, including accuracy, calibration, and robustness. Code is available at https://github.com/DensoITLab/featurePI .
    DNN-assisted Particle-based Bayesian Joint Synchronization and Localization. (arXiv:2110.02771v2 [cs.IT] UPDATED)
    In this work, we propose a Deep neural network-assisted Particle Filter-based (DePF) approach to address the Mobile User (MU) joint synchronization and localization (sync\&loc) problem in ultra dense networks. In particular, DePF deploys an asymmetric time-stamp exchange mechanism between the MUs and the Access Points (APs), which, traditionally, provides us with information about the MUs' clock offset and skew. However, information about the distance between an AP and an MU is also intrinsic to the propagation delay experienced by exchanged time-stamps. In addition, to estimate the angle of arrival of the received synchronization packet, DePF draws on the multiple signal classification algorithm that is fed by Channel Impulse Response (CIR) experienced by the sync packets. The CIR is also leveraged on to determine the link condition, i.e. Line-of-Sight (LoS) or Non-LoS. Finally, to perform joint sync\&loc, DePF capitalizes on particle Gaussian mixtures that allow for a hybrid particle-based and parametric Bayesian Recursive Filtering (BRF) fusion of the aforementioned pieces of information and thus jointly estimate the position and clock parameters of the MUs. The simulation results verifies the superiority of the proposed algorithm over the state-of-the-art schemes, especially that of Extended Kalman filter- and linearized BRF-based joint sync\&loc. In particular, only drawing on the synchronization time-stamp exchange and CIRs, for 90$\%$of the cases, the absolute position and clock offset estimation error remain below 1 meter and 2 nanoseconds, respectively.
    On the reversibility of adversarial attacks. (arXiv:2206.00772v1 [cs.LG])
    Adversarial attacks modify images with perturbations that change the prediction of classifiers. These modified images, known as adversarial examples, expose the vulnerabilities of deep neural network classifiers. In this paper, we investigate the predictability of the mapping between the classes predicted for original images and for their corresponding adversarial examples. This predictability relates to the possibility of retrieving the original predictions and hence reversing the induced misclassification. We refer to this property as the reversibility of an adversarial attack, and quantify reversibility as the accuracy in retrieving the original class or the true class of an adversarial example. We present an approach that reverses the effect of an adversarial attack on a classifier using a prior set of classification results. We analyse the reversibility of state-of-the-art adversarial attacks on benchmark classifiers and discuss the factors that affect the reversibility.
    Anarchic Federated Learning. (arXiv:2108.09875v2 [cs.LG] UPDATED)
    Present-day federated learning (FL) systems deployed over edge networks consists of a large number of workers with high degrees of heterogeneity in data and/or computing capabilities, which call for flexible worker participation in terms of timing, effort, data heterogeneity, etc. To satisfy the need for flexible worker participation, we consider a new FL paradigm called "Anarchic Federated Learning" (AFL) in this paper. In stark contrast to conventional FL models, each worker in AFL has the freedom to choose i) when to participate in FL, and ii) the number of local steps to perform in each round based on its current situation (e.g., battery level, communication channels, privacy concerns). However, such chaotic worker behaviors in AFL impose many new open questions in algorithm design. In particular, it remains unclear whether one could develop convergent AFL training algorithms, and if yes, under what conditions and how fast the achievable convergence speed is. Toward this end, we propose two Anarchic Federated Averaging (AFA) algorithms with two-sided learning rates for both cross-device and cross-silo settings, which are named AFA-CD and AFA-CS, respectively. Somewhat surprisingly, we show that, under mild anarchic assumptions, both AFL algorithms achieve the best known convergence rate as the state-of-the-art algorithms for conventional FL. Moreover, they retain the highly desirable {\em linear speedup effect} with respect of both the number of workers and local steps in the new AFL paradigm. We validate the proposed algorithms with extensive experiments on real-world datasets.
    Leveraging Non-uniformity in First-order Non-convex Optimization. (arXiv:2105.06072v3 [cs.LG] UPDATED)
    Classical global convergence results for first-order methods rely on uniform smoothness and the \L{}ojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform \L{}ojasiewicz inequality} (N\L{}). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical $\Omega(1/t^2)$ lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to $O(e^{-t})$ while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings.
    Combining Machine Learning and Agent-Based Modeling to Study Biomedical Systems. (arXiv:2206.01092v1 [q-bio.QM])
    Agent-based modeling (ABM) is a well-established paradigm for simulating complex systems via interactions between constituent entities. Machine learning (ML) refers to approaches whereby statistical algorithms 'learn' from data on their own, without imposing a priori theories of system behavior. Biological systems -- from molecules, to cells, to entire organisms -- consist of vast numbers of entities, governed by complex webs of interactions that span many spatiotemporal scales and exhibit nonlinearity, stochasticity and intricate coupling between entities. The macroscopic properties and collective dynamics of such systems are difficult to capture via continuum modelling and mean-field formalisms. ABM takes a 'bottom-up' approach that obviates these difficulties by enabling one to easily propose and test a set of well-defined 'rules' to be applied to the individual entities (agents) in a system. Evaluating a system and propagating its state over discrete time-steps effectively simulates the system, allowing observables to be computed and system properties to be analyzed. Because the rules that govern an ABM can be difficult to abstract and formulate from experimental data, there is an opportunity to use ML to help infer optimal, system-specific ABM rules. Once such rule-sets are devised, ABM calculations can generate a wealth of data, and ML can be applied there too -- e.g., to probe statistical measures that meaningfully describe a system's stochastic properties. As an example of synergy in the other direction (from ABM to ML), ABM simulations can generate realistic datasets for training ML algorithms (e.g., for regularization, to mitigate overfitting). In these ways, one can envision various synergistic ABM$\rightleftharpoons$ML loops. This review summarizes how ABM and ML have been integrated in contexts that span spatial scales from the cellular to population-level scale epidemiology.
    Learning to Untangle Genome Assembly with Graph Convolutional Networks. (arXiv:2206.00668v1 [q-bio.GN])
    A quest to determine the complete sequence of a human DNA from telomere to telomere started three decades ago and was finally completed in 2021. This accomplishment was a result of a tremendous effort of numerous experts who engineered various tools and performed laborious manual inspection to achieve the first gapless genome sequence. However, such method can hardly be used as a general approach to assemble different genomes, especially when the assembly speed is critical given the large amount of data. In this work, we explore a different approach to the central part of the genome assembly task that consists of untangling a large assembly graph from which a genomic sequence needs to be reconstructed. Our main motivation is to reduce human-engineered heuristics and use deep learning to develop more generalizable reconstruction techniques. Precisely, we introduce a new learning framework to train a graph convolutional network to resolve assembly graphs by finding a correct path through them. The training is supervised with a dataset generated from the resolved CHM13 human sequence and tested on assembly graphs built using real human PacBio HiFi reads. Experimental results show that a model, trained on simulated graphs generated solely from a single chromosome, is able to remarkably resolve all other chromosomes. Moreover, the model outperforms hand-crafted heuristics from a state-of-the-art \textit{de novo} assembler on the same graphs. Reconstructed chromosomes with graph networks are more accurate on nucleotide level, report lower number of contigs, higher genome reconstructed fraction and NG50/NGA50 assessment metrics.
    Graph Kernels Based on Multi-scale Graph Embeddings. (arXiv:2206.00979v1 [cs.LG])
    Graph kernels are conventional methods for computing graph similarities. However, most of the R-convolution graph kernels face two challenges: 1) They cannot compare graphs at multiple different scales, and 2) they do not consider the distributions of substructures when computing the kernel matrix. These two challenges limit their performances. To mitigate the two challenges, we propose a novel graph kernel called the Multi-scale Path-pattern Graph kernel (MPG), at the heart of which is the multi-scale path-pattern node feature map. Each element of the path-pattern node feature map is the number of occurrences of a path-pattern around a node. A path-pattern is constructed by the concatenation of all the node labels in a path of a truncated BFS tree rooted at each node. Since the path-pattern node feature map can only compare graphs at local scales, we incorporate into it the multiple different scales of the graph structure, which are captured by the truncated BFS trees of different depth. We use the Wasserstein distance to compute the similarity between the multi-scale path-pattern node feature maps of two graphs, considering the distributions of substructures. We empirically validate MPG on various benchmark graph datasets and demonstrate that it achieves state-of-the-art performance.
    Introducing One Sided Margin Loss for Solving Classification Problems in Deep Networks. (arXiv:2206.01002v1 [cs.LG])
    This paper introduces a new loss function, OSM (One-Sided Margin), to solve maximum-margin classification problems effectively. Unlike the hinge loss, in OSM the margin is explicitly determined with corresponding hyperparameters and then the classification problem is solved. In experiments, we observe that using OSM loss leads to faster training speeds and better accuracies than binary and categorical cross-entropy in several commonly used deep models for classification and optical character recognition problems. OSM has consistently shown better classification accuracies over cross-entropy and hinge losses for small to large neural networks. it has also led to a more efficient training procedure. We achieved state-of-the-art accuracies for small networks on several benchmark datasets of CIFAR10(98.82\%), CIFAR100(91.56\%), Flowers(98.04\%), Stanford Cars(93.91\%) with considerable improvements over other loss functions. Moreover, the accuracies are rather better than cross-entropy and hinge loss for large networks. Therefore, we strongly believe that OSM is a powerful alternative to hinge and cross-entropy losses to train deep neural networks on classification tasks.
    Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classifiers. (arXiv:2009.01367v3 [cs.LG] UPDATED)
    While neural network binary classifiers are often evaluated on metrics such as Accuracy and $F_1$-Score, they are commonly trained with a cross-entropy objective. How can this training-evaluation gap be addressed? While specific techniques have been adopted to optimize certain confusion matrix based metrics, it is challenging or impossible in some cases to generalize the techniques to other metrics. Adversarial learning approaches have also been proposed to optimize networks via confusion matrix based metrics, but they tend to be much slower than common training methods. In this work, we propose a unifying approach to training neural network binary classifiers that combines a differentiable approximation of the Heaviside function with a probabilistic view of the typical confusion matrix values using soft sets. Our theoretical analysis shows the benefit of using our method to optimize for a given evaluation metric, such as $F_1$-Score, with soft sets, and our extensive experiments show the effectiveness of our approach in several domains.
    Learning to Solve PDE-constrained Inverse Problems with Graph Networks. (arXiv:2206.00711v1 [cs.LG])
    Learned graph neural networks (GNNs) have recently been established as fast and accurate alternatives for principled solvers in simulating the dynamics of physical systems. In many application domains across science and engineering, however, we are not only interested in a forward simulation but also in solving inverse problems with constraints defined by a partial differential equation (PDE). Here we explore GNNs to solve such PDE-constrained inverse problems. Given a sparse set of measurements, we are interested in recovering the initial condition or parameters of the PDE. We demonstrate that GNNs combined with autodecoder-style priors are well-suited for these tasks, achieving more accurate estimates of initial conditions or physical parameters than other learned approaches when applied to the wave equation or Navier-Stokes equations. We also demonstrate computational speedups of up to 90x using GNNs compared to principled solvers. Project page: https://cyanzhao42.github.io/LearnInverseProblem
    Vygotskian Autotelic Artificial Intelligence: Language and Culture Internalization for Human-Like AI. (arXiv:2206.01134v1 [cs.AI])
    Building autonomous artificial agents able to grow open-ended repertoires of skills is one of the fundamental goals of AI. To that end, a promising developmental approach recommends the design of intrinsically motivated agents that learn new skills by generating and pursuing their own goals - autotelic agents. However, existing algorithms still show serious limitations in terms of goal diversity, exploration, generalization or skill composition. This perspective calls for the immersion of autotelic agents into rich socio-cultural worlds. We focus on language especially, and how its structure and content may support the development of new cognitive functions in artificial agents, just like it does in humans. Indeed, most of our skills could not be learned in isolation. Formal education teaches us to reason systematically, books teach us history, and YouTube might teach us how to cook. Crucially, our values, traditions, norms and most of our goals are cultural in essence. This knowledge, and some argue, some of our cognitive functions such as abstraction, compositional imagination or relational thinking, are formed through linguistic and cultural interactions. Inspired by the work of Vygotsky, we suggest the design of Vygotskian autotelic agents able to interact with others and, more importantly, able to internalize these interactions to transform them into cognitive tools supporting the development of new cognitive functions. This perspective paper proposes a new AI paradigm in the quest for artificial lifelong skill discovery. It justifies the approach by uncovering examples of new artificial cognitive functions emerging from interactions between language and embodiment in recent works at the intersection of deep reinforcement learning and natural language processing. Looking forward, it highlights future opportunities and challenges for Vygotskian Autotelic AI research.
    From Cities to Series: Complex Networks and Deep Learning for Improved Spatial and Temporal Analytics*. (arXiv:2206.01176v1 [cs.LG])
    Graphs have often been used to answer questions about the interaction between real-world entities by taking advantage of their capacity to represent complex topologies. Complex networks are known to be graphs that capture such non-trivial topologies; they are able to represent human phenomena such as epidemic processes, the dynamics of populations, and the urbanization of cities. The investigation of complex networks has been extrapolated to many fields of science, with particular emphasis on computing techniques, including artificial intelligence. In such a case, the analysis of the interaction between entities of interest is transposed to the internal learning of algorithms, a paradigm whose investigation is able to expand the state of the art in Computer Science. By exploring this paradigm, this thesis puts together complex networks and machine learning techniques to improve the understanding of the human phenomena observed in pandemics, pendular migration, and street networks. Accordingly, we contribute with: (i) a new neural network architecture capable of modeling dynamic processes observed in spatial and temporal data with applications in epidemics propagation, weather forecasting, and patient monitoring in intensive care units; (ii) a machine-learning methodology for analyzing and predicting links in the scope of human mobility between all the cities of Brazil; and, (iii) techniques for identifying inconsistencies in the urban planning of cities while tracking the most influential vertices, with applications over Brazilian and worldwide cities. We obtained results sustained by sound evidence of advances to the state of the art in artificial intelligence, rigorous formalisms, and ample experimentation. Our findings rely upon real-world applications in a range of domains, demonstrating the applicability of our methodologies.
    Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm. (arXiv:2206.00833v1 [cs.LG])
    Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.
    Primal-dual extrapolation methods for monotone inclusions under local Lipschitz continuity with applications to variational inequality, conic constrained saddle point, and convex conic optimization problems. (arXiv:2206.00973v1 [math.OC])
    In this paper we consider a class of structured monotone inclusion (MI) problems that consist of finding a zero in the sum of two monotone operators, in which one is maximal monotone while another is locally Lipschitz continuous. In particular, we first propose a primal-dual extrapolation (PDE) method for solving a structured strongly MI problem by modifying the classical forward-backward splitting method by using a point and operator extrapolation technique, in which the parameters are adaptively updated by a backtracking line search scheme. The proposed PDE method is almost parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\log \epsilon^{-1})$, measured by the amount of fundamental operations consisting only of evaluations of one operator and resolvent of another operator, for finding an $\epsilon$-residual solution of the structured strongly MI problem. We then propose another PDE method for solving a structured non-strongly MI problem by applying the above PDE method to approximately solve a sequence of structured strongly MI problems. The resulting PDE method is parameter-free, equipped with a verifiable termination criterion, and enjoys an operation complexity of ${\cal O}(\epsilon^{-1}\log \epsilon^{-1})$ for finding an $\epsilon$-residual solution of the structured non-strongly MI problem. As a consequence, we apply the latter PDE method to convex conic optimization, conic constrained saddle point, and variational inequality problems, and obtain complexity results for finding an $\epsilon$-KKT or $\epsilon$-residual solution of them under local Lipschitz continuity. To the best of our knowledge, no prior studies were conducted to investigate methods with complexity guarantees for solving the aforementioned problems under local Lipschitz continuity. All the complexity results obtained in this paper are entirely new.
    Mask-Guided Divergence Loss Improves the Generalization and Robustness of Deep Neural Network. (arXiv:2206.00913v1 [cs.LG])
    Deep neural network (DNN) with dropout can be regarded as an ensemble model consisting of lots of sub-DNNs (i.e., an ensemble sub-DNN where the sub-DNN is the remaining part of the DNN after dropout), and through increasing the diversity of the ensemble sub-DNN, the generalization and robustness of the DNN can be effectively improved. In this paper, a mask-guided divergence loss function (MDL), which consists of a cross-entropy loss term and an orthogonal term, is proposed to increase the diversity of the ensemble sub-DNN by the added orthogonal term. Particularly, the mask technique is introduced to assist in generating the orthogonal term for avoiding overfitting of the diversity learning. The theoretical analysis and extensive experiments on 4 datasets (i.e., MNIST, FashionMNIST, CIFAR10, and CIFAR100) manifest that MDL can improve the generalization and robustness of standard training and adversarial training. For CIFAR10 and CIFAR100, in standard training, the maximum improvement of accuracy is $1.38\%$ on natural data, $30.97\%$ on FGSM (i.e., Fast Gradient Sign Method) attack, $38.18\%$ on PGD (i.e., Projected Gradient Descent) attack. While in adversarial training, the maximum improvement is $1.68\%$ on natural data, $4.03\%$ on FGSM attack and $2.65\%$ on PGD attack.
    Boosting Independent Component Analysis. (arXiv:2112.06920v3 [stat.ML] UPDATED)
    Independent component analysis is intended to recover the mutually independent components from their linear mixtures. This technique has been widely used in many fields, such as data analysis, signal processing, and machine learning. To alleviate the dependency on prior knowledge concerning unknown sources, many nonparametric methods have been proposed. In this paper, we present a novel boosting-based algorithm for independent component analysis. Our algorithm consists of maximizing likelihood estimation via boosting and seeking unmixing matrix by the fixed-point method. A variety of experiments validate its performance compared with many of the presently known algorithms.
    Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates. (arXiv:2206.00832v1 [cs.LG])
    Benchmarking the tradeoff between neural network accuracy and training time is computationally expensive. Here we show how a multiplicative cyclic learning rate schedule can be used to construct a tradeoff curve in a single training run. We generate cyclic tradeoff curves for combinations of training methods such as Blurpool, Channels Last, Label Smoothing and MixUp, and highlight how these cyclic tradeoff curves can be used to evaluate the effects of algorithmic choices on network training efficiency.
    Dynamic Structure Estimation from Bandit Feedback. (arXiv:2206.00861v1 [cs.DM])
    This work present novel method for structure estimation of an underlying dynamical system. We tackle problems of estimating dynamic structure from bandit feedback contaminated by sub-Gaussian noise. In particular, we focus on periodically behaved discrete dynamical system in the Euclidean space, and carefully identify certain obtainable subset of full information of the periodic structure. We then derive a sample complexity bound for periodic structure estimation. Technically, asymptotic results for exponential sums are adopted to effectively average out the noise effects while preventing the information to be estimated from vanishing. For linear systems, the use of the Weyl sum further allows us to extract eigenstructures. Our theoretical claims are experimentally validated on simulations of toy examples, including Cellular Automata.
    Sampling Trade-Offs in Duty-Cycled Systems for Air Quality Low-Cost Sensors. (arXiv:2112.09072v2 [eess.SP] UPDATED)
    The use of low-cost sensors in conjunction with high-precision instrumentation for air pollution monitoring has shown promising results in recent years. One of the main challenges for these sensors has been the quality of their data, which is why the main efforts have focused on calibrating the sensors using machine learning techniques to improve the data quality. However, there is one aspect that has been overlooked, that is, these sensors are mounted on nodes that may have energy consumption restrictions if they are battery-powered. In this paper, we show the usual sensor data gathering process and we study the existing trade-offs between the sampling of such sensors, the quality of the sensor calibration, and the power consumption involved. To this end, we conduct experiments on prototype nodes measuring tropospheric ozone, nitrogen dioxide, and nitrogen monoxide at high frequency. The results show that the sensor sampling strategy directly affects the quality of the air pollution estimation and that each type of sensor may require different sampling strategies. In addition, duty cycles of 0.1 can be achieved when the sensors have response times in the order of two minutes, and duty cycles between 0.01 and 0.02 can be achieved when the sensor response times are negligible, calibrating with hourly reference values and maintaining a quality of calibrated data similar to when the node is connected to an uninterruptible power supply.
    The effective noise of Stochastic Gradient Descent. (arXiv:2112.10852v3 [cond-mat.dis-nn] UPDATED)
    Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
    Assessing the trade-off between prediction accuracy and interpretability for topic modeling on energetic materials corpora. (arXiv:2206.00773v1 [cs.CL])
    As the amount and variety of energetics research increases, machine aware topic identification is necessary to streamline future research pipelines. The makeup of an automatic topic identification process consists of creating document representations and performing classification. However, the implementation of these processes on energetics research imposes new challenges. Energetics datasets contain many scientific terms that are necessary to understand the context of a document but may require more complex document representations. Secondly, the predictions from classification must be understandable and trusted by the chemists within the pipeline. In this work, we study the trade-off between prediction accuracy and interpretability by implementing three document embedding methods that vary in computational complexity. With our accuracy results, we also introduce local interpretability model-agnostic explanations (LIME) of each prediction to provide a localized understanding of each prediction and to validate classifier decisions with our team of energetics experts. This study was carried out on a novel labeled energetics dataset created and validated by our team of energetics experts.
    Hard Negative Sampling Strategies for Contrastive Representation Learning. (arXiv:2206.01197v1 [cs.LG])
    One of the challenges in contrastive learning is the selection of appropriate \textit{hard negative} examples, in the absence of label information. Random sampling or importance sampling methods based on feature similarity often lead to sub-optimal performance. In this work, we introduce UnReMix, a hard negative sampling strategy that takes into account anchor similarity, model uncertainty and representativeness. Experimental results on several benchmarks show that UnReMix improves negative sample selection, and subsequently downstream performance when compared to state-of-the-art contrastive learning methods.
    How Infinitely Wide Neural Networks Benefit from Multi-task Learning -- an Exact Macroscopic Characterization. (arXiv:2112.15577v3 [cs.LG] UPDATED)
    In practice, multi-task learning (through learning features shared among tasks) is an essential property of deep neural networks (NNs). While infinite-width limits of NNs can provide a good intuition for their generalization behavior, the well-known infinite-width limits of NNs in the literature (e.g., neural tangent kernels) assume specific settings in which wide ReLU-NNs behave like shallow Gaussian Processes with a fixed kernel. Consequently, in such settings, these NNs lose their ability to benefit from multi-task learning in the infinite-width limit. In contrast, we prove that optimizing wide ReLU neural networks with at least one hidden layer using L2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limiting regime where the network width tends to infinity. We present an exact quantitative characterization of this infinite width limit in an appropriate function space that neatly describes multi-task learning.
    Dynamic Cardiac MRI Reconstruction Using Combined Tensor Nuclear Norm and Casorati Matrix Nuclear Norm Regularizations. (arXiv:2206.00831v1 [eess.IV])
    Low-rank tensor models have been applied in accelerating dynamic magnetic resonance imaging (dMRI). Recently, a new tensor nuclear norm based on t-SVD has been proposed and applied to tensor completion. Inspired by the different properties of the tensor nuclear norm (TNN) and the Casorati matrix nuclear norm (MNN), we introduce a combined TNN and Casorati MNN regularizations framework to reconstruct dMRI, which we term as TMNN. The proposed method simultaneously exploits the spatial structure and the temporal correlation of the dynamic MR data. The optimization problem can be efficiently solved by the alternating direction method of multipliers (ADMM). In order to further improve the computational efficiency, we develop a fast algorithm under the Cartesian sampling scenario. Numerical experiments based on cardiac cine MRI and perfusion MRI data demonstrate the performance improvement over the traditional Casorati nuclear norm regularization method.
    Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. (arXiv:2206.00939v1 [stat.ML])
    The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
    New Riemannian preconditioned algorithms for tensor completion via polyadic decomposition. (arXiv:2101.11108v2 [math.OC] UPDATED)
    We propose new Riemannian preconditioned algorithms for low-rank tensor completion via the polyadic decomposition of a tensor. These algorithms exploit a non-Euclidean metric on the product space of the factor matrices of the low-rank tensor in the polyadic decomposition form. This new metric is designed using an approximation of the diagonal blocks of the Hessian of the tensor completion cost function, thus has a preconditioning effect on these algorithms. We prove that the proposed Riemannian gradient descent algorithm globally converges to a stationary point of the tensor completion problem, with convergence rate estimates using the $\L{}$ojasiewicz property. Numerical results on synthetic and real-world data suggest that the proposed algorithms are more efficient in memory and time compared to state-of-the-art algorithms. Moreover, the proposed algorithms display a greater tolerance for overestimated rank parameters in terms of the tensor recovery performance, thus enable a flexible choice of the rank parameter.
    Nearly Optimal Best-of-Both-Worlds Algorithms for Online Learning with Feedback Graphs. (arXiv:2206.00873v1 [cs.LG])
    This study considers online learning with general directed feedback graphs. For this problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds for adversarial environments as well as poly-logarithmic regret bounds for stochastic environments. As Alon et al. [2015] have shown, tight regret bounds depend on the structure of the feedback graph: \textit{strongly observable} graphs yield minimax regret of $\tilde{\Theta}( \alpha^{1/2} T^{1/2} )$, while \textit{weakly observable} graphs induce minimax regret of $\tilde{\Theta}( \delta^{1/3} T^{2/3} )$, where $\alpha$ and $\delta$, respectively, represent the independence number of the graph and the domination number of a certain portion of the graph. Our proposed algorithm for strongly observable graphs has a regret bound of $\tilde{O}( \alpha^{1/2} T^{1/2} ) $ for adversarial environments, as well as of $ {O} ( \frac{\alpha (\ln T)^3 }{\Delta_{\min}} ) $ for stochastic environments, where $\Delta_{\min}$ expresses the minimum suboptimality gap. This result resolves an open question raised by Erez and Koren [2021]. We also provide an algorithm for weakly observable graphs that achieves a regret bound of $\tilde{O}( \delta^{1/3}T^{2/3} )$ for adversarial environments and poly-logarithmic regret for stochastic environments. The proposed algorithms are based on the follow-the-perturbed-leader approach combined with newly designed update rules for learning rates.
    Applied Federated Learning: Architectural Design for Robust and Efficient Learning in Privacy Aware Settings. (arXiv:2206.00807v1 [cs.LG])
    The classical machine learning paradigm requires the aggregation of user data in a central location where machine learning practitioners can preprocess data, calculate features, tune models and evaluate performance. The advantage of this approach includes leveraging high performance hardware (such as GPUs) and the ability of machine learning practitioners to do in depth data analysis to improve model performance. However, these advantages may come at a cost to data privacy. User data is collected, aggregated, and stored on centralized servers for model development. Centralization of data poses risks, including a heightened risk of internal and external security incidents as well as accidental data misuse. Federated learning with differential privacy is designed to avoid the server-side centralization pitfall by bringing the ML learning step to users' devices. Learning is done in a federated manner where each mobile device runs a training loop on a local copy of a model. Updates from on-device models are sent to the server via encrypted communication and through differential privacy to improve the global model. In this paradigm, users' personal data remains on their devices. Surprisingly, model training in this manner comes at a fairly minimal degradation in model performance. However, federated learning comes with many other challenges due to its distributed nature, heterogeneous compute environments and lack of data visibility. This paper explores those challenges and outlines an architectural design solution we are exploring and testing to productionize federated learning at Meta scale.
    Robustness to Label Noise Depends on the Shape of the Noise Distribution in Feature Space. (arXiv:2206.01106v1 [cs.LG])
    Machine learning classifiers have been demonstrated, both empirically and theoretically, to be robust to label noise under certain conditions -- notably the typical assumption is that label noise is independent of the features given the class label. We provide a theoretical framework that generalizes beyond this typical assumption by modeling label noise as a distribution over feature space. We show that both the scale and the shape of the noise distribution influence the posterior likelihood; and the shape of the noise distribution has a stronger impact on classification performance if the noise is concentrated in feature space where the decision boundary can be moved. For the special case of uniform label noise (independent of features and the class label), we show that the Bayes optimal classifier for $c$ classes is robust to label noise until the ratio of noisy samples goes above $\frac{c-1}{c}$ (e.g. 90% for 10 classes), which we call the tipping point. However, for the special case of class-dependent label noise (independent of features given the class label), the tipping point can be as low as 50%. Most importantly, we show that when the noise distribution targets decision boundaries (label noise is directly dependent on feature space), classification robustness can drop off even at a small scale of noise. Even when evaluating recent label-noise mitigation methods we see reduced accuracy when label noise is dependent on features. These findings explain why machine learning often handles label noise well if the noise distribution is uniform in feature-space; yet it also points to the difficulty of overcoming label noise when it is concentrated in a region of feature space where a decision boundary can move.
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v2 [cs.LG] UPDATED)
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.
    Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective. (arXiv:2205.07320v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis (LTH) has attracted attention because it can explain why over-parameterized models often show high generalization ability. It is known that when we use iterative magnitude pruning (IMP), which is an algorithm to find sparse networks with high generalization ability that can be trained from the initial weights independently, called winning tickets, the initial large learning rate does not work well in deep neural networks such as ResNet. However, since the initial large learning rate generally helps the optimizer to converge to flatter minima, we hypothesize that the winning tickets have relatively sharp minima, which is considered a disadvantage in terms of generalization ability. In this paper, we confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets. Finally, we revisit existing algorithms for finding winning tickets from a PAC-Bayesian perspective and provide new insights into these methods.
    Robust Longitudinal Control for Vehicular Autonomous Platoons Using Deep Reinforcement Learning. (arXiv:2206.01175v1 [eess.SY])
    In the last few years, researchers have applied machine learning strategies in the context of vehicular platoons to increase the safety and efficiency of cooperative transportation. Reinforcement Learning methods have been employed in the longitudinal spacing control of Cooperative Adaptive Cruise Control systems, but to date, none of those studies have addressed problems of disturbance rejection in such scenarios. Characteristics such as uncertain parameters in the model and external interferences may prevent agents from reaching null-spacing errors when traveling at cruising speed. On the other hand, complex communication topologies lead to specific training processes that can not be generalized to other contexts, demanding re-training every time the configuration changes. Therefore, in this paper, we propose an approach to generalize the training process of a vehicular platoon, such that the acceleration command of each agent becomes independent of the network topology. Also, we have modeled the acceleration input as a term with integral action, such that the Convolutional Neural Network is capable of learning corrective actions when the states are disturbed by unknown effects. We illustrate the effectiveness of our proposal with experiments using different network topologies, uncertain parameters, and external forces. Comparative analyses, in terms of the steady-state error and overshoot response, were conducted against the state-of-the-art literature. The findings offer new insights concerning generalization and robustness of using Reinforcement Learning in the control of autonomous platoons.
    Policy Gradient Algorithms with Monte-Carlo Tree Search for Non-Markov Decision Processes. (arXiv:2206.01011v1 [cs.LG])
    Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. Given a well-parameterized policy model, such as a neural network model, with appropriate initial parameters, the PG algorithms work well even when environment does not have the Markov property. Otherwise, they can be trapped on a plateau or suffer from peakiness effects. As another successful RL approach, algorithms based on Monte-Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results especially on the board game playing domain. They are also suitable to be applied to non-Markov decision processes. However, since the standard MCTS does not have the ability to learn state representation, the size of the tree-search space can be too large to search. In this work, we examine a mixture policy of PG and MCTS to complement each other's difficulties and take advantage of them. We derive conditions for asymptotic convergence with results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions. The effectivity of the proposed methods is verified through numerical experiments on non-Markov decision processes.
    Deep Learning on Implicit Neural Datasets. (arXiv:2206.01178v1 [cs.LG])
    Implicit neural representations (INRs) have become fast, lightweight tools for storing continuous data, but to date there is no general method for learning directly with INRs as a data representation. We introduce a principled deep learning framework for learning and inference directly with INRs of any type without reverting to grid-based features or operations. Our INR-Nets evaluate INRs on a low discrepancy sequence, enabling quasi-Monte Carlo (QMC) integration throughout the network. We prove INR-Nets are universal approximators on a large class of maps between $L^2$ functions. Additionally, INR-Nets have convergent gradients under the empirical measure, enabling backpropagation. We design INR-Nets as a continuous generalization of discrete networks, enabling them to be initialized with pre-trained models. We demonstrate learning of INR-Nets on classification (INR$\to$label) and segmentation (INR$\to$INR) tasks.
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v2 [stat.ML] UPDATED)
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
    Rare Gems: Finding Lottery Tickets at Initialization. (arXiv:2202.12002v2 [cs.LG] UPDATED)
    Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming "train, prune, re-train" approach. Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work by Frankle et al. and Su et al. presents concrete evidence that current algorithms for finding trainable networks at initialization, fail simple baseline comparisons, e.g., against training random sparse subnetworks. Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by proposing Gem-Miner which finds lottery tickets at initialization that beat current baselines. Gem-Miner finds lottery tickets trainable to accuracy competitive or better than Iterative Magnitude Pruning (IMP), and does so up to $19\times$ faster.
    Merlin-Arthur Classifiers: Formal Interpretability with Interactive Black Boxes. (arXiv:2206.00759v1 [cs.LG])
    We present a new theoretical framework for making black box classifiers such as Neural Networks interpretable, basing our work on clear assumptions and guarantees. In our setting, which is inspired by the Merlin-Arthur protocol from Interactive Proof Systems, two functions cooperate to achieve a classification together: the \emph{prover} selects a small set of features as a certificate and presents it to the \emph{classifier}. Including a second, adversarial prover allows us to connect a game-theoretic equilibrium to information-theoretic guarantees on the exchanged features. We define notions of completeness and soundness that enable us to lower bound the mutual information between features and class. To demonstrate good agreement between theory and practice, we support our framework by providing numerical experiments for Neural Network classifiers, explicitly calculating the mutual information of features with respect to the class.
    On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting. (arXiv:2206.00761v1 [cs.LG])
    The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.
    BayesFormer: Transformer with Uncertainty Estimation. (arXiv:2206.00826v1 [cs.CL])
    Transformer has become ubiquitous due to its dominant performance in various NLP and image processing tasks. However, it lacks understanding of how to generate mathematically grounded uncertainty estimates for transformer architectures. Models equipped with such uncertainty estimates can typically improve predictive performance, make networks robust, avoid over-fitting and used as acquisition function in active learning. In this paper, we introduce BayesFormer, a Transformer model with dropouts designed by Bayesian theory. We proposed a new theoretical framework to extend the approximate variational inference-based dropout to Transformer-based architectures. Through extensive experiments, we validate the proposed architecture in four paradigms and show improvements across the board: language modeling and classification, long-sequence understanding, machine translation and acquisition function for active learning.
    Comparing Conventional and Deep Feature Models for Classifying Fundus Photography of Hemorrhages. (arXiv:2206.01118v1 [eess.IV])
    Diabetic retinopathy is an eye-related pathology creating abnormalities and causing visual impairment, proper treatment of which requires identifying irregularities. This research uses a hemorrhage detection method and compares classification of conventional and deep features. Especially, method identifies hemorrhage connected with blood vessels or reside at retinal border and reported challenging. Initially, adaptive brightness adjustment and contrast enhancement rectify degraded images. Prospective locations of hemorrhages are estimated by a Gaussian matched filter, entropy thresholding, and morphological operation. Hemorrhages are segmented by a novel technique based on regional variance of intensities. Features are then extracted by conventional methods and deep models for training support vector machines, and results evaluated. Evaluation metrics for each model are promising, but findings suggest that comparatively, deep models are more effective than conventional features.
    RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks. (arXiv:2106.08928v4 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are widely used throughout neuroscience as models of local neural activity. Many properties of single RNNs are well characterized theoretically, but experimental neuroscience has moved in the direction of studying multiple interacting areas, and RNN theory needs to be likewise extended. We take a constructive approach towards this problem, leveraging tools from nonlinear control theory and machine learning to characterize when combinations of stable RNNs will themselves be stable. Importantly, we derive conditions which allow for massive feedback connections between interacting RNNs. We parameterize these conditions for easy optimization using gradient-based techniques, and show that stability-constrained `network of networks' can perform well on challenging sequential-processing benchmark tasks. Altogether, our results provide a principled approach towards understanding distributed, modular function in the brain.
    Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization. (arXiv:2206.00743v1 [math.OC])
    Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.
    Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation. (arXiv:2206.01137v1 [cs.CL])
    General translation models often still struggle to generate accurate translations in specialized domains. To guide machine translation practitioners and characterize the effectiveness of domain adaptation methods under different data availability scenarios, we conduct an in-depth empirical exploration of monolingual and parallel data approaches to domain adaptation of pre-trained, third-party, NMT models in settings where architecture change is impractical. We compare data centric adaptation methods in isolation and combination. We study method effectiveness in very low resource (8k parallel examples) and moderately low resource (46k parallel examples) conditions and propose an ensemble approach to alleviate reductions in original domain translation quality. Our work includes three domains: consumer electronic, clinical, and biomedical and spans four language pairs - Zh-En, Ja-En, Es-En, and Ru-En. We also make concrete recommendations for achieving high in-domain performance and release our consumer electronic and medical domain datasets for all languages and make our code publicly available.
    Compositional Coding Capsule Network with K-Means Routing for Text Classification. (arXiv:1810.09177v5 [cs.LG] UPDATED)
    Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.
    On the Difficulty of Defending Self-Supervised Learning against Model Extraction. (arXiv:2205.07890v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that trains models to transform complex inputs into representations without relying on explicit labels. These representations encode similarity structures that enable efficient learning of multiple downstream tasks. Recently, ML-as-a-Service providers have commenced offering trained SSL models over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure over APIs both make black-box extraction a realistic security threat. We thus explore model stealing attacks against SSL. Unlike traditional model extraction on classifiers that output labels, the victim models here output representations; these representations are of significantly higher dimensionality compared to the low-dimensional prediction scores output by classifiers. We construct several novel attacks and find that approaches that train directly on a victim's stolen representations are query efficient and enable high accuracy for downstream models. We then show that existing defenses against model extraction are inadequate and not easily retrofitted to the specificities of SSL.
    Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches. (arXiv:2202.13884v2 [q-bio.GN] UPDATED)
    Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in preserving sequence similarities, suggesting it as basis for the definition of novels representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the $k$-mers extracted from it, called $k$-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads and to assign them to the most likely gene from which they were originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.
    Uncalibrated Models Can Improve Human-AI Collaboration. (arXiv:2202.05983v2 [cs.AI] UPDATED)
    In many practical applications of AI, an AI model is used as a decision aid for human users. The AI provides advice that a human (sometimes) incorporates into their decision-making process. The AI advice is often presented with some measure of "confidence" that the human can use to calibrate how much they depend on or trust the advice. In this paper, we demonstrate that human-AI performance can be improved by calibrating this confidence to the humans using the advice. In practice, this means presenting calibrated AI models as more or less confident than they actually are. We show empirically that this can improve human-AI performance (measured as the accuracy and confidence of the human's final prediction after seeing the AI advice). We first train a model to predict human incorporation of AI advice using data from thousands of human interactions. This enables us to explicitly estimate how to transform the AI's prediction confidence, making the AI uncalibrated, in order to improve the final human prediction. We empirically validate our results across four different tasks--dealing with images, text and tabular data--involving hundreds of human participants. We further support our findings with simulation analysis. Our findings suggest the importance of and a framework for jointly optimizing the human-AI system in contrast to the standard paradigm of optimizing the AI model alone.
    Sparse Mixed Linear Regression with Guarantees: Taming an Intractable Problem with Invex Relaxation. (arXiv:2206.01167v1 [cs.LG])
    In this paper, we study the problem of sparse mixed linear regression on an unlabeled dataset that is generated from linear measurements from two different regression parameter vectors. Since the data is unlabeled, our task is not only to figure out a good approximation of the regression parameter vectors but also to label the dataset correctly. In its original form, this problem is NP-hard. The most popular algorithms to solve this problem (such as Expectation-Maximization) have a tendency to stuck at local minima. We provide a novel invex relaxation for this intractable problem which leads to a solution with provable theoretical guarantees. This relaxation enables exact recovery of data labels. Furthermore, we recover a close approximation of the regression parameter vectors which match the true parameter vectors in support and sign. Our formulation uses a carefully constructed primal dual witnesses framework for the invex problem. Furthermore, we show that the sample complexity of our method is only logarithmic in terms of the dimension of the regression parameter vectors.
    Composition of Relational Features with an Application to Explaining Black-Box Predictors. (arXiv:2206.00738v1 [cs.LG])
    Relational machine learning programs like those developed in Inductive Logic Programming (ILP) offer several advantages: (1) The ability to model complex relationships amongst data instances; (2) The use of domain-specific relations during model construction; and (3) The models constructed are human-readable, which is often one step closer to being human-understandable. However, these ILP-like methods have not been able to capitalise fully on the rapid hardware, software and algorithmic developments fuelling current developments in deep neural networks. In this paper, we treat relational features as functions and use the notion of generalised composition of functions to derive complex functions from simpler ones. We formulate the notion of a set of $\text{M}$-simple features in a mode language $\text{M}$ and identify two composition operators ($\rho_1$ and $\rho_2$) from which all possible complex features can be derived. We use these results to implement a form of "explainable neural network" called Compositional Relational Machines, or CRMs, which are labelled directed-acyclic graphs. The vertex-label for any vertex $j$ in the CRM contains a feature-function $f_j$ and a continuous activation function $g_j$. If $j$ is a "non-input" vertex, then $f_j$ is the composition of features associated with vertices in the direct predecessors of $j$. Our focus is on CRMs in which input vertices (those without any direct predecessors) all have $\text{M}$-simple features in their vertex-labels. We provide a randomised procedure for constructing and learning such CRMs. Using a notion of explanations based on the compositional structure of features in a CRM, we provide empirical evidence on synthetic data of the ability to identify appropriate explanations; and demonstrate the use of CRMs as 'explanation machines' for black-box models that do not provide explanations for their predictions.
    Combining machine learning with physics: A framework for tracking and sorting multiple dark solitons. (arXiv:2111.04881v2 [cond-mat.quant-gas] UPDATED)
    In ultracold-atom experiments, data often comes in the form of images which suffer information loss inherent in the techniques used to prepare and measure the system. This is particularly problematic when the processes of interest are complicated, such as interactions among excitations in Bose-Einstein condensates (BECs). In this paper, we describe a framework combining machine learning (ML) models with physics-based traditional analyses to identify and track multiple solitonic excitations in images of BECs. We use an ML-based object detector to locate the solitonic excitations and develop a physics-informed classifier to sort solitonic excitations into physically motivated subcategories. Lastly, we introduce a quality metric quantifying the likelihood that a specific feature is a longitudinal soliton. Our trained implementation of this framework, SolDet, is publicly available as an open-source python package. SolDet is broadly applicable to feature identification in cold-atom images when trained on a suitable user-provided dataset.
    Dynamic Privacy Budget Allocation Improves Data Efficiency of Differentially Private Gradient Descent. (arXiv:2101.07413v2 [cs.LG] UPDATED)
    Protecting privacy in learning while maintaining the model performance has become increasingly critical in many applications that involve sensitive data. A popular private learning framework is differentially private learning composed of many privatized gradient iterations by noising and clipping. Under the privacy constraint, it has been shown that the dynamic policies could improve the final iterate loss, namely the quality of published models. In this talk, we will introduce these dynamic techniques for learning rate, batch size, noise magnitude and gradient clipping. Also, we discuss how the dynamic policy could change the convergence bounds which further provides insight of the impact of dynamic methods.
    SanitAIs: Unsupervised Data Augmentation to Sanitize Trojaned Neural Networks. (arXiv:2109.04566v3 [cs.LG] UPDATED)
    Self-supervised learning (SSL) methods have resulted in broad improvements to neural network performance by leveraging large, untapped collections of unlabeled data to learn generalized underlying structure. In this work, we harness unsupervised data augmentation (UDA), an SSL technique, to mitigate backdoor or Trojan attacks on deep neural networks. We show that UDA is more effective at removing trojans than current state-of-the-art methods for both feature space and point triggers, over a range of model architectures, trojans, and data quantities provided for trojan removal. These results demonstrate that UDA is both an effective and practical approach to mitigating the effects of backdoors on neural networks.
    On the Generalization of Neural Combinatorial Optimization Heuristics. (arXiv:2206.00787v1 [cs.LG])
    Neural Combinatorial Optimization approaches have recently leveraged the expressiveness and flexibility of deep neural networks to learn efficient heuristics for hard Combinatorial Optimization (CO) problems. However, most of the current methods lack generalization: for a given CO problem, heuristics which are trained on instances with certain characteristics underperform when tested on instances with different characteristics. While some previous works have focused on varying the training instances properties, we postulate that a one-size-fit-all model is out of reach. Instead, we formalize solving a CO problem over a given instance distribution as a separate learning task and investigate meta-learning techniques to learn a model on a variety of tasks, in order to optimize its capacity to adapt to new tasks. Through extensive experiments, on two CO problems, using both synthetic and realistic instances, we show that our proposed meta-learning approach significantly improves the generalization of two state-of-the-art models.
    DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks. (arXiv:2206.00843v1 [cs.LG])
    Efficient deep neural network (DNN) models equipped with compact operators (e.g., depthwise convolutions) have shown great potential in reducing DNNs' theoretical complexity (e.g., the total number of weights/operations) while maintaining a decent model accuracy. However, existing efficient DNNs are still limited in fulfilling their promise in boosting real-hardware efficiency, due to their commonly adopted compact operators' low hardware utilization. In this work, we open up a new compression paradigm for developing real-hardware efficient DNNs, leading to boosted hardware efficiency while maintaining model accuracy. Interestingly, we observe that while some DNN layers' activation functions help DNNs' training optimization and achievable accuracy, they can be properly removed after training without compromising the model accuracy. Inspired by this observation, we propose a framework dubbed DepthShrinker, which develops hardware-friendly compact networks via shrinking the basic building blocks of existing efficient DNNs that feature irregular computation patterns into dense ones with much improved hardware utilization and thus real-hardware efficiency. Excitingly, our DepthShrinker framework delivers hardware-friendly compact networks that outperform both state-of-the-art efficient DNNs and compression techniques, e.g., a 3.06\% higher accuracy and 1.53$\times$ throughput on Tesla V100 over SOTA channel-wise pruning method MetaPruning. Our codes are available at: https://github.com/RICE-EIC/DepthShrinker.
    Collaborative Learning of Distributions under Heterogeneity and Communication Constraints. (arXiv:2206.00707v1 [stat.ML])
    In modern machine learning, users often have to collaborate to learn distributions that generate the data. Communication can be a significant bottleneck. Prior work has studied homogeneous users -- i.e., whose data follow the same discrete distribution -- and has provided optimal communication-efficient methods. However, these methods rely heavily on homogeneity, and are less applicable in the common case when users' discrete distributions are heterogeneous. Here we consider a natural and tractable model of heterogeneity, where users' discrete distributions only vary sparsely, on a small number of entries. We propose a novel two-stage method named SHIFT: First, the users collaborate by communicating with the server to learn a central distribution; relying on methods from robust statistics. Then, the learned central distribution is fine-tuned to estimate the individual distributions of users. We show that SHIFT is minimax optimal in our model of heterogeneity and under communication constraints. Further, we provide experimental results using both synthetic data and $n$-gram frequency estimation in the text domain, which corroborate its efficiency.  ( 2 min )
    Indeterminacy in Latent Variable Models: Characterization and Strong Identifiability. (arXiv:2206.00801v1 [stat.ML])
    Most modern latent variable and probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Recent applications of such models have indicated the need for \textit{strongly} identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, most notably by the iVAE (arXiv:1907.04809 [stat.ML]), which excludes many -- but not all -- indeterminacies. We construct a full theoretical framework for analyzing the indeterminacies of latent variable models, and characterize them precisely in terms of properties of the generator functions and the latent variable prior distributions. To illustrate, we apply the framework to better understand the structure of recent identifiability results. We then investigate how we might specify strongly identifiable latent variable models, and construct two such classes of models. One is a straightforward modification of iVAE; the other uses ideas from optimal transport and leads to novel models and connections to recent work.  ( 2 min )
    Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning. (arXiv:2206.00796v1 [cs.LG])
    The $Q$-learning algorithm is a simple and widely-used stochastic approximation scheme for reinforcement learning, but the basic protocol can exhibit instability in conjunction with function approximation. Such instability can be observed even with linear function approximation. In practice, tools such as target networks and experience replay appear to be essential, but the individual contribution of each of these mechanisms is not well understood theoretically. This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt: a second order update rule, a set of target networks, and a mechanism akin to experience replay. Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of step elapsed. We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error. The algorithm also exhibits a form of instance-dependence, in that its performance depends on the "effective" feature dimension.  ( 2 min )
    Bayesian Inference of Stochastic Dynamical Networks. (arXiv:2206.00858v1 [stat.ML])
    Network inference has been extensively studied in several fields, such as systems biology and social sciences. Learning network topology and internal dynamics is essential to understand mechanisms of complex systems. In particular, sparse topologies and stable dynamics are fundamental features of many real-world continuous-time networks. Given that usually only a partial set of nodes are able to observe, in this paper, we consider linear continuous-time systems to depict networks since they can model unmeasured nodes via transfer functions. Additionally, measurements tend to be noisy and with low and varying sampling frequencies. For this reason, we consider continuous-time models (CT) since discrete-time approximations often require fine-grained measurements and uniform sampling steps. The developed method applies dynamical structure functions (DSFs) derived from linear stochastic differential equations (SDEs) to describe networks of measured nodes. Further, a numerical sampling method, preconditioned Crank-Nicolson (pCN), is used to refine coarse-grained trajectories to improve inference accuracy. The simulation conducted on random and ring networks, and a synthetic biological network illustrate that our method achieves state-of-the-art performance compared with group sparse Bayesian learning (GSBL), BINGO, kernel-based methods, dynGENIE3, GENIE3 and ARNI. In particular, these are challenging networks, suggesting that the developed method can be applied under a wide range of contexts.  ( 2 min )
    Coordinated Double Machine Learning. (arXiv:2206.00885v1 [stat.ML])
    Double machine learning is a statistical method for leveraging complex black-box models to construct approximately unbiased treatment effect estimates given observational data with high-dimensional covariates, under the assumption of a partially linear model. The idea is to first fit on a subset of the samples two non-linear predictive models, one for the continuous outcome of interest and one for the observed treatment, and then to estimate a linear coefficient for the treatment using the remaining samples through a simple orthogonalized regression. While this methodology is flexible and can accommodate arbitrary predictive models, typically trained independently of one another, this paper argues that a carefully coordinated learning algorithm for deep neural networks may reduce the estimation bias. The improved empirical performance of the proposed method is demonstrated through numerical experiments on both simulated and real data.  ( 2 min )
    (Machine) Learning What Policies Value. (arXiv:2206.00727v1 [econ.GN])
    When a policy prioritizes one person over another, is it because they benefit more, or because they are preferred? This paper develops a method to uncover the values consistent with observed allocation decisions. We use machine learning methods to estimate how much each individual benefits from an intervention, and then reconcile its allocation with (i) the welfare weights assigned to different people; (ii) heterogeneous treatment effects of the intervention; and (iii) weights on different outcomes. We demonstrate this approach by analyzing Mexico's PROGRESA anti-poverty program. The analysis reveals that while the program prioritized certain subgroups -- such as indigenous households -- the fact that those groups benefited more implies that they were in fact assigned a lower welfare weight. The PROGRESA case illustrates how the method makes it possible to audit existing policies, and to design future policies that better align with values.  ( 2 min )
    Federated Learning in Non-IID Settings Aided by Differentially Private Synthetic Data. (arXiv:2206.00686v1 [cs.LG])
    Federated learning (FL) is a privacy-promoting framework that enables potentially large number of clients to collaboratively train machine learning models. In a FL system, a server coordinates the collaboration by collecting and aggregating clients' model updates while the clients' data remains local and private. A major challenge in federated learning arises when the local data is heterogeneous -- the setting in which performance of the learned global model may deteriorate significantly compared to the scenario where the data is identically distributed across the clients. In this paper we propose FedDPMS (Federated Differentially Private Means Sharing), an FL algorithm in which clients deploy variational auto-encoders to augment local datasets with data synthesized using differentially private means of latent data representations communicated by a trusted server. Such augmentation ameliorates effects of data heterogeneity across the clients without compromising privacy. Our experiments on deep image classification tasks demonstrate that FedDPMS outperforms competing state-of-the-art FL methods specifically designed for heterogeneous data settings.  ( 2 min )
    How Biased is Your Feature?: Computing Fairness Influence Functions with Global Sensitivity Analysis. (arXiv:2206.00667v1 [cs.LG])
    Fairness in machine learning has attained significant focus due to the widespread application of machine learning in high-stake decision-making tasks. Unless regulated with a fairness objective, machine learning classifiers might demonstrate unfairness/bias towards certain demographic populations in the data. Thus, the quantification and mitigation of the bias induced by classifiers have become a central concern. In this paper, we aim to quantify the influence of different features on the bias of a classifier. To this end, we propose a framework of Fairness Influence Function (FIF), and compute it as a scaled difference of conditional variances in the prediction of the classifier. We also instantiate an algorithm, FairXplainer, that uses variance decomposition among the subset of features and a local regressor to compute FIFs accurately, while also capturing the intersectional effects of the features. Our experimental analysis validates that FairXplainer captures the influences of both individual features and higher-order feature interactions, estimates the bias more accurately than existing local explanation methods, and detects the increase/decrease in bias due to affirmative/punitive actions in the classifier.  ( 2 min )
    Bayesian Learning to Discover Mathematical Operations in Governing Equations of Dynamic Systems. (arXiv:2206.00669v1 [cs.LG])
    Discovering governing equations from data is critical for diverse scientific disciplines as they can provide insights into the underlying phenomenon of dynamic systems. This work presents a new representation for governing equations by designing the Mathematical Operation Network (MathONet) with a deep neural network-like hierarchical structure. Specifically, the MathONet is stacked by several layers of unary operations (e.g., sin, cos, log) and binary operations (e.g., +,-), respectively. An initialized MathONet is typically regarded as a super-graph with a redundant structure, a sub-graph of which can yield the governing equation. We develop a sparse group Bayesian learning algorithm to extract the sub-graph by employing structurally constructed priors over the redundant mathematical operations. By demonstrating the chaotic Lorenz system, Lotka-Volterra system, and Kolmogorov-Petrovsky-Piskunov system, the proposed method can discover the ordinary differential equations (ODEs) and partial differential equations (PDEs) from the observations given limited mathematical operations, without any prior knowledge on possible expressions of the ODEs and PDEs.  ( 2 min )

  • Open

    Can anyone help me with this brainstormed idea? I'm very new to RL.
    Hello all, As the title states, I am really new to RL. I have been working on one project and I'll need to create a custom environment. The environment will not be made from an image with pixels. Instead, the environment will be constructed out of nodes and edges--a network. They will be defined by their relationship to each other (i.e., edge 1 connects nodes A and B). The agent will travel from node to node along edges. How might I get that to work; I already have thee data. submitted by /u/professorDissociate [link] [comments]  ( 1 min )
    Does env.reset openai gym randomly reinitialize the environment?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Anyone know any accessible guides to using TF-agents for bandit based problems?
    I know TensorFlow has some well documented tutorials but I am getting confused and stuck with developing for my side project. Ideally I would like to chat with someone on Discord to help me out. Thanks! submitted by /u/WirrryWoo [link] [comments]  ( 1 min )
    PyBullet objects act abnormally and I don't know why
    Hello I am fairly new with pybullet and I am having some issue with my simulation. Basically, there is a cube on a table and as soon as a robot tries to touch it the cube acts abnormally and "enters" the table. I will attach a small movie so you can see it. ​ https://reddit.com/link/v2wtcz/video/hixholvlx3391/player Link for .urdf and .obj of TABLE Link for .urdf and .obj of CUBE I am using pybullet in python, let me know if you would want me to share more information. Thanks! submitted by /u/gabrigoo [link] [comments]  ( 1 min )
    Where do you intern?
    I am an RL guy, I found it’s hard to get an RL internship. Only few really big companies like Microsoft, NVidia, Google, Tesla, etc. Is there any other opportunities in not-so-big companies where I could find an RL internship submitted by /u/Blasphemer666 [link] [comments]  ( 1 min )
  • Open

    Amazon SageMaker Notebook Instances now support configuring and restricting IMDS versions
    Today, we’re excited to announce that Amazon SageMaker now supports the ability to configure Instance Metadata Service Version 2 (IMDSv2) for Notebook Instances, and for administrators to control the minimum version with which end-users create new Notebook Instances. You can now choose IMDSv2 only for your new and existing SageMaker Notebook Instances to take advantage […]  ( 6 min )
    Reimagine search on GitHub repositories with the power of the Amazon Kendra GitHub connector
    Amazon Kendra offers highly accurate semantic and natural language search powered by machine learning (ML). Many organizations use GitHub as a code hosting platform for version control and to redefine collaboration of open-source software projects. A GitHub account repository might include many content types, such as files, issues, issue comments, issue comment attachments, pull requests, […]  ( 8 min )
    Merge cells and column headers in Amazon Textract tables
    Financial documents such as bank, loan, or mortgage statements are often formatted to be visually appealing and easy to read for the human eye. These same features can also make automated processing challenging at times. For instance, in the following sample statement, merging rows or columns in a table helps reduce information redundancy, but it […]  ( 5 min )
    Detect financial transaction fraud using a Graph Neural Network with Amazon SageMaker
    Fraud plagues many online businesses and costs them billions of dollars each year. Financial fraud, counterfeit reviews, bot attacks, account takeovers, and spam are all examples of online fraud and malicious behaviors. Although many businesses take approaches to combat online fraud, these existing approaches can have severe limitations. First, many existing methods aren’t sophisticated or […]  ( 10 min )
  • Open

    [D] Looking for recommended papers on document Key-Value extraction models
    Looking for recommended papers on general document Key-value extraction. My searches are coming up mostly with papers that rely on highly domain-specific heuristics. Any good places to start my search would be appreciated. submitted by /u/piccalillihighlands [link] [comments]  ( 1 min )
    [P] What if AB testing is impossible to setup? I wrote a blog to measure impact using backdoor adjustment, a type of causal analysis
    To ensure that every feature has a measurable impact on the broader platform my team will set up and run A/B testing on each new feature or product change, but what happens when a new feature needs to be released quickly and there is not enough time for a traditional testing approach? To make sure that these quick changes could still be measured I found a way to perform accurate pre-post analysis using a back-door adjustment of causal analysis. I wanted to share my findings with the community as it was able to help my team at DoorDash make quick bug fixes and still be able to measure the impact. Please check out the article to get the technical details and provide any feedback on my approach. https://doordash.engineering/2022/06/02/using-back-door-adjustment-causal-analysis-to-measure-pre-post-effects/ submitted by /u/tripleespresso7 [link] [comments]  ( 1 min )
    [D] Building a AI training cluster
    Hey there. We are about to buy hardware for training models. I'm eyeing an HP DL580 G8 because of a good deal on it. We also plan to scale up and add more servers to create a compute cluster. The server comes with 4 cpu's and 512gb of ram, this will certainly not all be used to train one model. The question here becomes what is the best solution: Install VMware, poxmox, Truenas for virtualisation and passing the gpu to the vm or does this limit the training in any way? or rather just install ubuntu server on it? Any recommendations what hypervisor/OS to use? or resources about how other people setup their servers for machine learning? Thanks in advance. submitted by /u/Joytimmermans [link] [comments]  ( 1 min )
    [R] Towards artificial general intelligence via a multimodal foundation model (Nature)
    This is published in Nature, so supposedly more notable than yet another multimodal experiment. But the way the article presents the results, leaves me confused about how this compares and contrasts to e.g. DeepMind Gato? https://www.nature.com/articles/s41467-022-30761-2 submitted by /u/valdanylchuk [link] [comments]  ( 2 min )
    [D] Recommendation System based on DNN (Softmax Model)
    So, I was going through Recommendation system google's colab and couldn't understand that for training matrix factorization model, we are giving both user embeddings and movie embeddings to CFModel class. But in case of Softmax model we are only giving movie embeddings. Why not user embeddings too? I am putting only relevant part of code. Full code can be seen in the link. Matrix Factorization Model embeddings = { "user_id": U, "movie_id": V } return CFModel(embeddings, train_loss, [metrics]) Softmax Model : metrics = ( {"train_loss": train_loss, "test_loss": test_loss}, {"test_precision_at_10": test_precision_at_10} ) embeddings = {"movie_id": movie_embeddings} return CFModel(embeddings, train_loss, metrics) submitted by /u/Expensive_build [link] [comments]  ( 1 min )
    [N] FAIR gets "decentralilzed"
    Announcement: https://ai.facebook.com/blog/building-with-ai-across-all-of-meta/ In the new model we will distribute the ownership of these AI systems back to Meta’s product groups. But we do so with the caveat that they must invest in a balanced portfolio that supports existing systems while also advancing the state of the art in AI. submitted by /u/MassivePellfish [link] [comments]  ( 1 min )
    [D] Pretrained SOTA Medical Classification Models for Download
    I'm doing some research that requires a state of the art medical image classifier for comparison. Can anyone point me in the direction of a performant classifier that has been trained on medical images (any type of medical image e.g. CT scan, x-ray, histopathology, etc.)? submitted by /u/i_wasserman [link] [comments]
    [P] Live Video Inference
    Hi everyone, I have a dataset of annotated videos for binary classification. I.e. each video just has one label: positive or negative. Right now I have a trained model in place for taking a complete video as input and outputting a classification. So far so good. Now the clients I'm working with want a feature for inference on live videos as well. I'm sort of at a loss on how to achieve this with the dataset I'm given. Since live video inference would entail inferring frame by frame and I don't actually have any labelled frames (only the entire video is given a label). So what I'm thinking is that I have to somehow figure out which frames in the positively-labelled videos are the ones to actually make the video positive. Is there a theory for this kind of problem? Here's what I've considered so far: 1) Use an anomaly detection algorithm to figure out which frames are anomalous to the others, and use those frames as positive frames for training the frame-by-frame model. 2) This is a long shot but you take out one frame at a time for the positive videos and infer on the new video with that frame missing. Then see which frame(s) makes the confidence score for positivity drop the most after being taken out. I'm sure there are better ideas out there. Anyone have any suggestions or papers I could look into about this kind of problem? submitted by /u/diningeachox [link] [comments]  ( 1 min )
    [N] [P] Machine Learning in dbt DAG
    Today, we are super excited to announce our open-source Layer DBT Bigquery Adapter which runs ML pipelines inside dbt DAG with BigQuery (more DWH support coming soon...) as the backing data warehouse. It's in a very early stage but we wanted share it with you to get your feedback. SELECT id, layer.predict("layer/clothing/models/objectdetection", ARRAY[image]) FROM {{ ref("products") }} 📷 With Layer dbt Adapter you can: Score your data with a machine learning model from Layer with SQL. Train an AutoML model with your data [coming soon...] Train a custom machine learning model with your data [coming soon...] 📷 Dive into dbt examples: Predicting survials of Titanic - End to end ML pipeline (feature extraction+scoring) which predicts the survivals of the Titanic disaster. Sentiment analysis of product reviews - An example that shows how to make multi-language sentiment analysis. Object detection in product images - Detects cloths from product images using a pretrained computer vision model. We would love your feedback on our new open-source tool! It will highly influence our roadmap.Thank you! submitted by /u/mehmetecevit [link] [comments]  ( 1 min )
    [P] A domain adaptation library that I wrote: PyTorch Adapt
    I wrote a library for domain adaptation, which is a type of machine learning that repurposes existing models to work in new domains. A toy example is adapting a model trained on MNIST for use on colored digits. The library is modular, so you can drop an algorithm into your training for-loop like this: hook = DANNHook(optimizers) for data in tqdm(dataloader): data = batch_to_device(data, device) # Optimization is done inside the hook. # The returned loss is for logging. _, loss = hook({**models, **data}) One challenge is that domain adaptation algorithms come in many different forms, making it difficult to write hooks that can be re-used across multiple algorithms. Suppose you want to add a loss function, "Foo", to the DANN algorithm. You want to apply Foo to logits from both source an…  ( 1 min )
    [Project] BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster ML algos, 50% less RAM usage for all old/new hardware - Hyperlearn Reborn.
    Hello everyone!! It's been a while!! Years back I released Hyperlearn https://github.com/danielhanchen/hyperlearn. It has 1.2K Github stars, where I made tonnes of algos faster. I was a bit busy back at NVIDIA and my startup, and I've been casually developing some algos. The question is are people still interested in fast algorithms? Does anyone want to collaborate on reviving Hyperlearn? (Or making a NEW package?) Note the current package is ahhh A MESSS... I'm fixing it - sit tight!! NEW algos for release: PCA with 50% less memory usage with ZERO data corruption!! (Maths tricks :)) (ie no need to do X - X.mean()!!!)) How you may ask???! Randomized PCA with 50% less memory usage (ie no need to do X - X.mean()). Linear Regression is EVEN faster with now Pivoted Cholesky making algo …  ( 4 min )
    [P] TinyML: Slope control for Robots with Arduino and Neuton AI
    Today smart household appliances can be found in almost every home as they greatly simplify our daily routine. A vivid example of such a gadget is a vacuum cleaner robot, representing a concentration of technology: a complex embedded system composed of some microcontrollers, many sensors, and a lot of… software! But how many times does your little helper behave stupidly and block itself over obstacles? I found a solution to this problem and implemented an inclination estimator system based on an accelerometer using a TinyML model on the Arduino’s Nicla Sense ME. In my tutorial, I’ll provide step-by-step guidelines on how to set up Nicla and approach the task in two ways (by building regression and multiclassification models) using a free TinyML framework that allows to automatically build neural networks and deploy them on small computing devices, such as Nicla. Check out the full version of my experiment here:https://www.hackster.io/leonardocavagnis/tinyml-slope-control-for-robots-with-arduino-pro-485061 submitted by /u/Leo_Cav [link] [comments]  ( 1 min )
    [P] How do I do preprocessing on a flutter app
    So me and my team are making an audio classification app for android. We used a python backend connected to the flutter app for the actual classification part, but now we want to get rid of that and do it all in flutter with tflite. Problem is, we relied on librosa for our data preprocessing (getting a mel spectrogram) and we can't find any libraries to get mel spectrograms in flutter. Does anyone here know of one? Or can recommend another way to preprocess for our tflite model? submitted by /u/initiald-ejavu [link] [comments]  ( 2 min )
    [D] Inputs on scalable cost effective pipeline
    Hi, all. I have multiple deep learning/machine learning / naive based tasks that I want to deploy online through an API. I have been trying to figure out the best way to do it for some time, but I am overwhelmed by the number of different frameworks and packages available on AWS and GCP. Multiple tools on both platforms seem to have overlapping responsibilities with unclear limitations, making it hard to choose. I want to obtain a scalable pipeline that saves as much money as possible (using, for example, spot pricing) and is easily expandable with new components. My idea was to use celery and create a task for each different data processing method I have. The APIs would simply add an entry into the celery's queue, and the workers would take care of the rest. Scaling up or down the pipeline would be just a matter of adding or removing celery workers, then. ​ How would you approach the problem? Do you know of any resources worth readying to build an architecture like this? Is there any particular instrument on AWS or GCP that would allow me to easily take care of this task? submitted by /u/assassin_canederlo [link] [comments]  ( 2 min )
    [D] Can LLMs be updated, e.g. to follow the daily news? Are there any such regularly updated models publicly known yet?
    Some context can be provided in the prompt, but for the bigger picture it is insufficient. I understand companies will not release anything like it until they solve the bias/censorship issues somehow, but did anyone mention an internal demo, or a project in progress? Or is there a lower scale open source experimental project? It would be so much fun to get some summaries or Q&A on the current events, latest science/tech developments, etc. Edit (thanks u/adt): There are projects trying to connect a language model to Internet and/or some add-on memory for facts. For example, WebGPT (which might be on its way to a product launch), BlenderBot 2.0 by Meta, and Jurassic-X by AI21. submitted by /u/valdanylchuk [link] [comments]  ( 2 min )
    [D] Uncertainty quantification
    Hi all, here is a link to an article I wrote on uncertainty in machine learning and I explain how to quantify uncertainty using uq360 metamodels. https://medium.com/total-digital-factory/how-to-add-confidence-to-your-machine-learning-models-b1228217858e Feel free to reply and suggest any changes submitted by /u/islem75ds [link] [comments]
  • Open

    Subreddits for text to image discussion? (Bees playing volleyball at the beach)
    submitted by /u/meromachin [link] [comments]
    In this article we show you how to use Bert transformer with spacy3 to train joint entities and relation extraction classifier
    submitted by /u/UBIAI [link] [comments]
    Is the Shazam AI an example of supervised or unsupervised learning?
    I think supervised but im not sure. submitted by /u/JustinFieldsBurner [link] [comments]
    Neuralink Update – May 2022 (Part 2)
    submitted by /u/1024cities [link] [comments]
    Best books for CV/ CV applications in Retail
    Experts here, please help me with the best books written on Computer Visions applications in Retail. Planning to get a hang of the capabilities of the current state of CV technology. Thanks folks! submitted by /u/blazingkrishna [link] [comments]
    Have an .EDU address? Free Deep Learning Cert from Nvidia!
    Hey fellow Redditors, ​ Just thought we would share a cool invite from Nvidia! ​ During the Month of June, Nvidia is offering all students & educators a free Deep Learning workshop with a certificate of completion. ​ NVIDIA DEEP LEARNING INSTITUTE submitted by /u/PSSC-Labs [link] [comments]
    "Eat bug." - Increasingly Verbose bot with GPT-3
    Original input: Eat bug. Final output: Did you know that eating bugs is actually really good for you? You may think that sounds gross, but bugs are packed with nutrients and are a great source of protein. Plus, they are sustainable and environmentally-friendly. So why not give it a try? There are lots of ways to eat bugs, like roasted, in a soup, or even in a smoothie. Trust me, you won't even taste the difference. Eating bugs is a great way to get your protein intake up without harming the environment. Bugs are a sustainable source of food that are packed with nutrients. You can eat them roasted, in a soup, or even in a smoothie. Trust me, you won't even taste the difference. So why not give it a try? Roasting bugs is a great way to make them crispy and delicious. Just toss them in some olive oil and sea salt, and roast them in the oven at 400 degrees for 20 minutes. Then, add them to your favorite salad or enjoy them as a snack. If you're looking for a heartier meal, try making a soup with bugs. This is a great way to get all the nutrients and protein that bugs have to offer. Just add some bugs to your favorite soup recipe, or try out a new one. You can even add bugs to your morning smoothie. This is a great way to get all the nutrients from bugs without even tasting them. Just add a handful of bugs to your favorite smoothie recipe and blend away. You won't even know they're there! Eating bugs is a great way to get your protein intake up without harming the environment. Bugs are a sustainable source of food that are packed with nutrients. So why not give it a try? ​ https://youtu.be/TxYSUwJjmYQ ​ https://github.com/daveshap/IncreasinglyVerbose submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    is there any ai I can use for free to simulate hypothetical wars ??
    Like a downloable program or something online ?? submitted by /u/Warm_Lingonberry_204 [link] [comments]  ( 1 min )
    All roads lead to AGI
    submitted by /u/bendee983 [link] [comments]
    Crystals: GPT-3 Explanation Narrated and Visualized in [4K 60 FPS] w/ VQGAN + CLIP
    submitted by /u/MLInsights [link] [comments]
    9+ Best Computer Vision Books for Beginners & advance to read in 2022
    submitted by /u/Lakshmireddys [link] [comments]
    MLEM - The First Open, Git-based Machine Learning Model Deployment and Management Tool Introduced
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Fun with the inspoirobot generator
    ​ https://preview.redd.it/tr9a2ejdz3391.jpg?width=650&format=pjpg&auto=webp&s=073144ff240dd12944739dff14be3f40c5c95089 https://preview.redd.it/0v8ckocez3391.jpg?width=650&format=pjpg&auto=webp&s=7b5fab132a6b0ef325d18f7a464d3b0e5f79ef80 https://preview.redd.it/hg8w7cbfz3391.jpg?width=650&format=pjpg&auto=webp&s=36691b1817f1d0b7423c14bfb52b5132215570cf https://preview.redd.it/g6qv1htbz3391.jpg?width=650&format=pjpg&auto=webp&s=87c75d36b0fda947f5ef0a088634a269358761ca https://preview.redd.it/i12cthhbz3391.jpg?width=650&format=pjpg&auto=webp&s=0d92f86351cc2087cda012c23bc4da29b7085682 https://preview.redd.it/3wel9r5hz3391.jpg?width=650&format=pjpg&auto=webp&s=5843e4cc06648f9c244a02bafbe7b6b5b3b356da https://preview.redd.it/590vgluhz3391.jpg?width=650&format=pjpg&auto=webp&s=07a58029b2fe61b0e17c733adc35526981bd6e89 submitted by /u/Difficult-Reality103 [link] [comments]
    Mandala - GPT-3 explains
    submitted by /u/MLInsights [link] [comments]
  • Open

    Recommendation system using Deep neural network based on Softmax Model
    So, I was going through Recommendation system google's colab and couldn't understand that for training matrix factorization model, we are giving both user embeddings and movie embeddings to CFModel class. But in case of Softmax model we are only giving movie embeddings. Why not user embeddings too? I am putting only relevant part of code. Full code can be seen in the link. Matrix Factorization Model embeddings = { "user_id": U, "movie_id": V } return CFModel(embeddings, train_loss, [metrics]) Softmax Model : metrics = ( {"train_loss": train_loss, "test_loss": test_loss}, {"test_precision_at_10": test_precision_at_10} ) embeddings = {"movie_id": movie_embeddings} return CFModel(embeddings, train_loss, metrics) submitted by /u/Expensive_build [link] [comments]  ( 1 min )
    How to Make the Universe Think for Us: Physicists are building neural networks out of vibrations, voltages and lasers
    submitted by /u/nickb [link] [comments]  ( 1 min )
  • Open

    Best Practices for Deploying Language Models
    Cohere, OpenAI, and AI21 Labs have developed a preliminary set of best practices applicable to any organization developing or deploying large language models. Computers that can read and write are here, and they have the potential to fundamentally impact daily life. The future of human–machine interaction is full  ( 3 min )
  • Open

    JONA-ROBOT THE PROPHETESS, WHALE LABORATORY & HOPE
    On Techno-Prophets in The Service of Humanity Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 1 min )
  • Open

    GFN Thursday Jumps Into June With 25 New Games Coming This Month
    Celebrate the onset of summer this GFN Thursday with 25 more games joining the GeForce NOW library, including seven additions this week. Because why would you ever go outside? Looking to spend the summer months in Space Marine armor? Games Workshop is kicking off its Warhammer Skulls event for its sixth year, with great discounts Read article > The post GFN Thursday Jumps Into June With 25 New Games Coming This Month appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    DisPFL: Towards Communication-Efficient Personalized Federated Learning via Decentralized Sparse Training. (arXiv:2206.00187v1 [cs.LG])
    Personalized federated learning is proposed to handle the data heterogeneity problem amongst clients by learning dedicated tailored local models for each user. However, existing works are often built in a centralized way, leading to high communication pressure and high vulnerability when a failure or an attack on the central server occurs. In this work, we propose a novel personalized federated learning framework in a decentralized (peer-to-peer) communication protocol named Dis-PFL, which employs personalized sparse masks to customize sparse local models on the edge. To further save the communication and computation cost, we propose a decentralized sparse training technique, which means that each local model in Dis-PFL only maintains a fixed number of active parameters throughout the whole local training and peer-to-peer communication process. Comprehensive experiments demonstrate that Dis-PFL significantly saves the communication bottleneck for the busiest node among all clients and, at the same time, achieves higher model accuracy with less computation cost and communication rounds. Furthermore, we demonstrate that our method can easily adapt to heterogeneous local clients with varying computation complexities and achieves better personalized performances.  ( 2 min )
    On Gap-dependent Bounds for Offline Reinforcement Learning. (arXiv:2206.00177v1 [cs.LG])
    This paper presents a systematic study on gap-dependent sample complexity in offline reinforcement learning. Prior work showed when the density ratio between an optimal policy and the behavior policy is upper bounded (the optimal policy coverage assumption), then the agent can achieve an $O\left(\frac{1}{\epsilon^2}\right)$ rate, which is also minimax optimal. We show under the optimal policy coverage assumption, the rate can be improved to $O\left(\frac{1}{\epsilon}\right)$ when there is a positive sub-optimality gap in the optimal $Q$-function. Furthermore, we show when the visitation probabilities of the behavior policy are uniformly lower bounded for states where an optimal policy's visitation probabilities are positive (the uniform optimal policy coverage assumption), the sample complexity of identifying an optimal policy is independent of $\frac{1}{\epsilon}$. Lastly, we present nearly-matching lower bounds to complement our gap-dependent upper bounds.  ( 2 min )
    Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts. (arXiv:2206.00586v1 [cs.LG])
    There is a rising interest in industrial online applications where data becomes available sequentially. Inspired by the recommendation of playlists to users where their preferences can be collected during the listening of the entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward associated with the pull of an arm is partitioned over a finite number of consecutive rounds following the pull. This setting, unexplored so far to the best of our knowledge, is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull instead of being fully disclosed in a single, potentially delayed round. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW, which exploit the partial information disclosed by the reward collected over time. We show that our algorithms provide better asymptotical regret upper bounds than delayed-feedback bandit algorithms when a property characterizing a broad set of reward structures of practical interest, namely alpha-smoothness, holds. We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.  ( 2 min )
    Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data. (arXiv:2206.00524v1 [cs.CL])
    Society needs to develop a system to detect hate and offense to build a healthy and safe environment. However, current research in this field still faces four major shortcomings, including deficient pre-processing techniques, indifference to data imbalance issues, modest performance models, and lacking practical applications. This paper focused on developing an intelligent system capable of addressing these shortcomings. Firstly, we proposed an efficient pre-processing technique to clean comments collected from Vietnamese social media. Secondly, a novel hate speech detection (HSD) model, which is the combination of a pre-trained PhoBERT model and a Text-CNN model, was proposed for solving tasks in Vietnamese. Thirdly, EDA techniques are applied to deal with imbalanced data to improve the performance of classification models. Besides, various experiments were conducted as baselines to compare and investigate the proposed model's performance against state-of-the-art methods. The experiment results show that the proposed PhoBERT-CNN model outperforms SOTA methods and achieves an F1-score of 67,46% and 98,45% on two benchmark datasets, ViHSD and HSD-VLSP, respectively. Finally, we also built a streaming HSD application to demonstrate the practicality of our proposed system.  ( 2 min )
    Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus. (arXiv:2206.00159v1 [cs.LG])
    This paper considers offline multi-agent reinforcement learning. We propose the strategy-wise concentration principle which directly builds a confidence interval for the joint strategy, in contrast to the point-wise concentration principle that builds a confidence interval for each point in the joint action space. For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose a computationally efficient algorithm whose sample complexity enjoys a better dependency on the number of actions than the prior methods based on the point-wise bonus. Furthermore, for offline multi-agent general-sum Markov games, based on the strategy-wise bonus and a novel surrogate function, we give the first algorithm whose sample complexity only scales $\sum_{i=1}^mA_i$ where $A_i$ is the action size of the $i$-th player and $m$ is the number of players. In sharp contrast, the sample complexity of methods based on the point-wise bonus would scale with the size of the joint action space $\Pi_{i=1}^m A_i$ due to the curse of multiagents. Lastly, all of our algorithms can naturally take a pre-specified strategy class $\Pi$ as input and output a strategy that is close to the best strategy in $\Pi$. In this setting, the sample complexity only scales with $\log |\Pi|$ instead of $\sum_{i=1}^mA_i$.  ( 2 min )
    Graph Machine Learning for Design of High-Octane Fuels. (arXiv:2206.00619v1 [cs.LG])
    Fuels with high-knock resistance enable modern spark-ignition engines to achieve high efficiency and thus low CO2 emissions. Identification of molecules with desired autoignition properties indicated by a high research octane number and a high octane sensitivity is therefore of great practical relevance and can be supported by computer-aided molecular design (CAMD). Recent developments in the field of graph machine learning (graph-ML) provide novel, promising tools for CAMD. We propose a modular graph-ML CAMD framework that integrates generative graph-ML models with graph neural networks and optimization, enabling the design of molecules with desired ignition properties in a continuous molecular space. In particular, we explore the potential of Bayesian optimization and genetic algorithms in combination with generative graph-ML models. The graph-ML CAMD framework successfully identifies well-established high-octane components. It also suggests new candidates, one of which we experimentally investigate and use to illustrate the need for further auto-ignition training data.  ( 2 min )
    A Simple Structure For Building A Robust Model. (arXiv:2204.11596v2 [cs.CV] UPDATED)
    As deep learning applications, especially programs of computer vision, are increasingly deployed in our lives, we have to think more urgently about the security of these applications.One effective way to improve the security of deep learning models is to perform adversarial training, which allows the model to be compatible with samples that are deliberately created for use in attacking the model.Based on this, we propose a simple architecture to build a model with a certain degree of robustness, which improves the robustness of the trained network by adding an adversarial sample detection network for cooperative training. At the same time, we design a new data sampling strategy that incorporates multiple existing attacks, allowing the model to adapt to many different adversarial attacks with a single training.We conducted some experiments to test the effectiveness of this design based on Cifar10 dataset, and the results indicate that it has some degree of positive effect on the robustness of the model.Our code could be found at https://github.com/dowdyboy/simple_structure_for_robust_model .  ( 2 min )
    Temporal Multiresolution Graph Neural Networks For Epidemic Prediction. (arXiv:2205.14831v2 [cs.LG] UPDATED)
    In this paper, we introduce Temporal Multiresolution Graph Neural Networks (TMGNN), the first architecture that both learns to construct the multiscale and multiresolution graph structures and incorporates the time-series signals to capture the temporal changes of the dynamic graphs. We have applied our proposed model to the task of predicting future spreading of epidemic and pandemic based on the historical time-series data collected from the actual COVID-19 pandemic and chickenpox epidemic in several European countries, and have obtained competitive results in comparison to other previous state-of-the-art temporal architectures and graph learning algorithms. We have shown that capturing the multiscale and multiresolution structures of graphs is important to extract either local or global information that play a critical role in understanding the dynamic of a global pandemic such as COVID-19 which started from a local city and spread to the whole world. Our work brings a promising research direction in forecasting and mitigating future epidemics and pandemics.  ( 2 min )
    Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top. (arXiv:2206.00529v1 [cs.LG])
    Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA - a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Lojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.  ( 2 min )
    Taming Continuous Posteriors for Latent Variational Dialogue Policies. (arXiv:2205.07633v2 [cs.CL] UPDATED)
    Utilizing amortized variational inference for latent-action reinforcement learning (RL) has been shown to be an effective approach in Task-oriented Dialogue (ToD) systems for optimizing dialogue success. Until now, categorical posteriors have been argued to be one of the main drivers of performance. In this work we revisit Gaussian variational posteriors for latent-action RL and show that they can yield even better performance than categoricals. We achieve this by simplifying the training procedure and propose ways to regularize the latent dialogue policy to retain good response coherence. Using continuous latent representations our model achieves state of the art dialogue success rate on the MultiWOZ benchmark, and also compares well to categorical latent methods in response coherence.  ( 2 min )
    A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin. (arXiv:2206.00648v1 [q-fin.ST])
    Bitcoin, with its ever-growing popularity, has demonstrated extreme price volatility since its origin. This volatility, together with its decentralised nature, make Bitcoin highly subjective to speculative trading as compared to more traditional assets. In this paper, we propose a multimodal model for predicting extreme price fluctuations. This model takes as input a variety of correlated assets, technical indicators, as well as Twitter content. In an in-depth study, we explore whether social media discussions from the general public on Bitcoin have predictive power for extreme price movements. A dataset of 5,000 tweets per day containing the keyword `Bitcoin' was collected from 2015 to 2021. This dataset, called PreBit, is made available online. In our hybrid model, we use sentence-level FinBERT embeddings, pretrained on financial lexicons, so as to capture the full contents of the tweets and feed it to the model in an understandable way. By combining these embeddings with a Convolutional Neural Network, we built a predictive model for significant market movements. The final multimodal ensemble model includes this NLP model together with a model based on candlestick data, technical indicators and correlated asset prices. In an ablation study, we explore the contribution of the individual modalities. Finally, we propose and backtest a trading strategy based on the predictions of our models with varying prediction threshold and show that it can used to build a profitable trading strategy with a reduced risk over a `hold' or moving average strategy.  ( 2 min )
    Contextual Bandits with Knapsacks for a Conversion Model. (arXiv:2206.00314v1 [cs.LG])
    We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward $r(a,\mathbf{x}_t)$ is gained and vector costs $c(a_t,\mathbf{x}_t)$ are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] but we show that the techniques introduced in this article may also be applied to the latter case. Namely, the adaptive policies exhibited solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order (OPT/$B$) $\sqrt{T}$, where $B$ is the total budget allowed, OPT is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.  ( 2 min )
    Federated Learning in Satellite Constellations. (arXiv:2206.00307v1 [cs.IT])
    Distributed machine learning (DML) results from the synergy between machine learning and connectivity. Federated learning (FL) is a prominent instance of DML in which intermittently connected mobile clients contribute to the training of a common learning model. This paper presents the new context brought to FL by satellite constellations where the connectivity patterns are significantly different from the ones assumed in terrestrial FL. We provide a taxonomy of different types of satellite connectivity relevant for FL and show how the distributed training process can overcome the slow convergence due to long offline times of clients by taking advantage of the predictable intermittency of the satellite communication links.  ( 2 min )
    Realistic Deep Learning May Not Fit Benignly. (arXiv:2206.00501v1 [cs.LG])
    Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the heavy overparameterization setting, benign overfitting can now fail in the presence of label noise. Our study explains our empirical observations, and naturally leads to a simple technique known as self-training that can boost the model's generalization performances. Furthermore, our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.  ( 2 min )
    Higher-Order Attention Networks. (arXiv:2206.00606v1 [cs.LG])
    This paper introduces higher-order attention networks (HOANs), a novel class of attention-based neural networks defined on a generalized higher-order domain called a combinatorial complex (CC). Similar to hypergraphs, CCs admit arbitrary set-like relations between a collection of abstract entities. Simultaneously, CCs permit the construction of hierarchical higher-order relations analogous to those supported by cell complexes. Thus, CCs effectively generalize both hypergraphs and cell complexes and combine their desirable characteristics. By exploiting the rich combinatorial nature of CCs, HOANs define a new class of message-passing attention-based networks that unifies higher-order neural networks. Our evaluation on tasks related to mesh shape analysis and graph learning demonstrates that HOANs attain competitive, and in some examples superior, predictive performance in comparison to state-of-the-art neural networks.  ( 2 min )
    Control of Two-way Coupled Fluid Systems with Differentiable Solvers. (arXiv:2206.00342v1 [cs.LG])
    We investigate the use of deep neural networks to control complex nonlinear dynamical systems, specifically the movement of a rigid body immersed in a fluid. We solve the Navier Stokes equations with two way coupling, which gives rise to nonlinear perturbations that make the control task very challenging. Neural networks are trained in an unsupervised way to act as controllers with desired characteristics through a process of learning from a differentiable simulator. Here we introduce a set of physically interpretable loss terms to let the networks learn robust and stable interactions. We demonstrate that controllers trained in a canonical setting with quiescent initial conditions reliably generalize to varied and challenging environments such as previously unseen inflow conditions and forcing, although they do not have any fluid information as input. Further, we show that controllers trained with our approach outperform a variety of classical and learned alternatives in terms of evaluation metrics and generalization capabilities.  ( 2 min )
    Decentralized Competing Bandits in Non-Stationary Matching Markets. (arXiv:2206.00120v1 [stat.ML])
    Understanding complex dynamics of two-sided online matching markets, where the demand-side agents compete to match with the supply-side (arms), has recently received substantial interest. To that end, in this paper, we introduce the framework of decentralized two-sided matching market under non stationary (dynamic) environments. We adhere to the serial dictatorship setting, where the demand-side agents have unknown and different preferences over the supply-side (arms), but the arms have fixed and known preference over the agents. We propose and analyze a decentralized and asynchronous learning algorithm, namely Decentralized Non-stationary Competing Bandits (\texttt{DNCB}), where the agents play (restrictive) successive elimination type learning algorithms to learn their preference over the arms. The complexity in understanding such a system stems from the fact that the competing bandits choose their actions in an asynchronous fashion, and the lower ranked agents only get to learn from a set of arms, not \emph{dominated} by the higher ranked agents, which leads to \emph{forced exploration}. With carefully defined complexity parameters, we characterize this \emph{forced exploration} and obtain sub-linear (logarithmic) regret of \texttt{DNCB}. Furthermore, we validate our theoretical findings via experiments.
    MAD-EN: Microarchitectural Attack Detection through System-wide Energy Consumption. (arXiv:2206.00101v1 [cs.CR])
    Microarchitectural attacks have become more threatening the hardware security than before with the increasing diversity of attacks such as Spectre and Meltdown. Vendor patches cannot keep up with the pace of the new threats, which makes the dynamic anomaly detection tools more evident than before. Unfortunately, previous studies utilize hardware performance counters that lead to high performance overhead and profile limited number of microarchitectural attacks due to the small number of counters that can be profiled concurrently. This yields those detection tools inefficient in real-world scenarios. In this study, we introduce MAD-EN dynamic detection tool that leverages system-wide energy consumption traces collected from a generic Intel RAPL tool to detect ongoing anomalies in a system. In our experiments, we show that CNN-based MAD-EN can detect 10 different microarchitectural attacks with a total of 15 variants with the highest F1 score of 0.999, which makes our tool the most generic attack detection tool so far. Moreover, individual attacks can be distinguished with a 98% accuracy after an anomaly is detected in a system. We demonstrate that MAD-EN introduces 69.3% less performance overhead compared to performance counter-based detection mechanisms.
    PAGER: Progressive Attribute-Guided Extendable Robust Image Generation. (arXiv:2206.00162v1 [cs.CV])
    This work presents a generative modeling approach based on successive subspace learning (SSL). Unlike most generative models in the literature, our method does not utilize neural networks to analyze the underlying source distribution and synthesize images. The resulting method, called the progressive attribute-guided extendable robust image generative (PAGER) model, has advantages in mathematical transparency, progressive content generation, lower training time, robust performance with fewer training samples, and extendibility to conditional image generation. PAGER consists of three modules: core generator, resolution enhancer, and quality booster. The core generator learns the distribution of low-resolution images and performs unconditional image generation. The resolution enhancer increases image resolution via conditional generation. Finally, the quality booster adds finer details to generated images. Extensive experiments on MNIST, Fashion-MNIST, and CelebA datasets are conducted to demonstrate generative performance of PAGER.
    Self-supervised Learning for Label Sparsity in Computational Drug Repositioning. (arXiv:2206.00262v1 [cs.LG])
    The computational drug repositioning aims to discover new uses for marketed drugs, which can accelerate the drug development process and play an important role in the existing drug discovery system. However, the number of validated drug-disease associations is scarce compared to the number of drugs and diseases in the real world. Too few labeled samples will make the classification model unable to learn effective latent factors of drugs, resulting in poor generalization performance. In this work, we propose a multi-task self-supervised learning framework for computational drug repositioning. The framework tackles label sparsity by learning a better drug representation. Specifically, we take the drug-disease association prediction problem as the main task, and the auxiliary task is to use data augmentation strategies and contrast learning to mine the internal relationships of the original drug features, so as to automatically learn a better drug representation without supervised labels. And through joint training, it is ensured that the auxiliary task can improve the prediction accuracy of the main task. More precisely, the auxiliary task improves drug representation and serving as additional regularization to improve generalization. Furthermore, we design a multi-input decoding network to improve the reconstruction ability of the autoencoder model. We evaluate our model using three real-world datasets. The experimental results demonstrate the effectiveness of the multi-task self-supervised learning framework, and its predictive ability is superior to the state-of-the-art model.
    Strongly Augmented Contrastive Clustering. (arXiv:2206.00380v1 [cs.LG])
    Deep clustering has attracted increasing attention in recent years due to its capability of joint representation learning and clustering via deep neural networks. In its latest developments, the contrastive learning has emerged as an effective technique to substantially enhance the deep clustering performance. However, the existing contrastive learning based deep clustering algorithms mostly focus on some carefully-designed augmentations (often with limited transformations to preserve the structure), referred to as weak augmentations, but cannot go beyond the weak augmentations to explore the more opportunities in stronger augmentations (with more aggressive transformations or even severe distortions). In this paper, we present an end-to-end deep clustering approach termed strongly augmented contrastive clustering (SACC), which extends the conventional two-augmentation-view paradigm to multiple views and jointly leverages strong and weak augmentations for strengthened deep clustering. Particularly, we utilize a backbone network with triply-shared weights, where a strongly augmented view and two weakly augmented views are incorporated. Based on the representations produced by the backbone, the weak-weak view pair and the strong-weak view pairs are simultaneously exploited for the instance-level contrastive learning (via an instance projector) and the cluster-level contrastive learning (via a cluster projector), which, together with the backbone, can be jointly optimized in a purely unsupervised manner. Experimental results on five challenging image datasets have shown the superior performance of the proposed SACC approach over the state-of-the-art.
    DM$^2$: Distributed Multi-Agent Reinforcement Learning for Distribution Matching. (arXiv:2206.00233v1 [cs.MA])
    Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to explicit coordination schemes. The proposed algorithm (DM$^2$) leverages distribution matching to facilitate independent agents' coordination. Each individual agent matches a target distribution of concurrently sampled trajectories from a joint expert policy. The theoretical analysis shows that under some conditions, if each agent optimizes their individual distribution matching objective, the agents increase a lower bound on the objective of matching the joint expert policy, allowing convergence to the joint expert policy. Further, if the distribution matching objective is aligned with a joint task, a combination of environment reward and distribution matching reward leads to the same equilibrium. Experimental validation on the StarCraft domain shows that combining the reward for distribution matching with the environment reward allows agents to outperform a fully distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled in order to outperform the fully distributed baseline.
    Evaluating Gaussian Grasp Maps for Generative Grasping Models. (arXiv:2206.00432v1 [cs.RO])
    Generalising robotic grasping to previously unseen objects is a key task in general robotic manipulation. The current method for training many antipodal generative grasping models rely on a binary ground truth grasp map generated from the centre thirds of correctly labelled grasp rectangles. However, these binary maps do not accurately reflect the positions in which a robotic arm can correctly grasp a given object. We propose a continuous Gaussian representation of annotated grasps to generate ground truth training data which achieves a higher success rate on a simulated robotic grasping benchmark. Three modern generative grasping networks are trained with either binary or Gaussian grasp maps, along with recent advancements from the robotic grasping literature, such as discretisation of grasp angles into bins and an attentional loss function. Despite negligible difference according to the standard rectangle metric, Gaussian maps better reproduce the training data and therefore improve success rates when tested on the same simulated robot arm by avoiding collisions with the object: achieving 87.94\% accuracy. Furthermore, the best performing model is shown to operate with a high success rate when transferred to a real robotic arm, at high inference speeds, without the need for transfer learning. The system is then shown to be capable of performing grasps on an antagonistic physical object dataset benchmark.
    Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer. (arXiv:2206.00481v1 [cs.CV])
    Vision Transformers (ViTs) enabled the use of transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective self-supervised learning (SSL) strategy to train ViTs, that without any external annotation, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step. We investigated our proposed methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets.
    A Kernelised Stein Statistic for Assessing Implicit Generative Models. (arXiv:2206.00149v1 [stat.ML])
    Synthetic data generation has become a key ingredient for training machine learning procedures, addressing tasks such as data augmentation, analysing privacy-sensitive data, or visualising representative samples. Assessing the quality of such synthetic data generators hence has to be addressed. As (deep) generative models for synthetic data often do not admit explicit probability distributions, classical statistical procedures for assessing model goodness-of-fit may not be applicable. In this paper, we propose a principled procedure to assess the quality of a synthetic data generator. The procedure is a kernelised Stein discrepancy (KSD)-type test which is based on a non-parametric Stein operator for the synthetic data generator of interest. This operator is estimated from samples which are obtained from the synthetic data generator and hence can be applied even when the model is only implicit. In contrast to classical testing, the sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed. Experimental results on synthetic distributions and trained generative models on synthetic and real datasets illustrate that the method shows improved power performance compared to existing approaches.
    CoNSoLe: Convex Neural Symbolic Learning. (arXiv:2206.00257v1 [cs.LG])
    Learning the underlying equation from data is a fundamental problem in many disciplines. Recent advances rely on Neural Networks (NNs) but do not provide theoretical guarantees in obtaining the exact equations owing to the non-convexity of NNs. In this paper, we propose Convex Neural Symbolic Learning (CoNSoLe) to seek convexity under mild conditions. The main idea is to decompose the recovering process into two steps and convexify each step. In the first step of searching for right symbols, we convexify the deep Q-learning. The key is to maintain double convexity for both the negative Q-function and the negative reward function in each iteration, leading to provable convexity of the negative optimal Q function to learn the true symbol connections. Conditioned on the exact searching result, we construct a Locally Convex equation Learner (LoCaL) neural network to convexify the estimation of symbol coefficients. With such a design, we quantify a large region with strict convexity in the loss surface of LoCaL for commonly used physical functions. Finally, we demonstrate the superior performance of the CoNSoLe framework over the state-of-the-art on a diverse set of datasets.
    Convergence of Stein Variational Gradient Descent under a Weaker Smoothness Condition. (arXiv:2206.00508v1 [math.ST])
    Stein Variational Gradient Descent (SVGD) is an important alternative to the Langevin-type algorithms for sampling from probability distributions of the form $\pi(x) \propto \exp(-V(x))$. In the existing theory of Langevin-type algorithms and SVGD, the potential function $V$ is often assumed to be $L$-smooth. However, this restrictive condition excludes a large class of potential functions such as polynomials of degree greater than $2$. Our paper studies the convergence of the SVGD algorithm for distributions with $(L_0,L_1)$-smooth potentials. This relaxed smoothness assumption was introduced by Zhang et al. [2019a] for the analysis of gradient clipping algorithms. With the help of trajectory-independent auxiliary conditions, we provide a descent lemma establishing that the algorithm decreases the $\mathrm{KL}$ divergence at each iteration and prove a complexity bound for SVGD in the population limit in terms of the Stein Fisher information.
    Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization. (arXiv:2206.00176v1 [cs.LG])
    Discovering governing equations of complex dynamical systems directly from data is a central problem in scientific machine learning. In recent years, the sparse identification of nonlinear dynamics (SINDy) framework, powered by heuristic sparse regression methods, has become a dominant tool for learning parsimonious models. We propose an exact formulation of the SINDy problem using mixed-integer optimization (MIO) to solve the sparsity constrained regression problem to provable optimality in seconds. On a large number of canonical ordinary and partial differential equations, we illustrate the dramatic improvement of our approach in accurate model discovery while being more sample efficient, robust to noise, and flexible in accommodating physical constraints.
    Open Environment Machine Learning. (arXiv:2206.00423v1 [cs.LG])
    Conventional machine learning studies generally assume close world scenarios where important factors of the learning process hold invariant. With the great success of machine learning, nowadays, more and more practical tasks, particularly those involving open world scenarios where important factors are subject to change, called open environment machine learning (Open ML) in this article, are present to the community. Evidently it is a grand challenge for machine learning turning from close world to open world. It becomes even more challenging since, in various big data tasks, data are usually accumulated with time, like streams, while it is hard to train the machine learning model after collecting all data as in conventional studies. This article briefly introduces some advances in this line of research, focusing on techniques concerning emerging new classes, decremental/incremental features, changing data distributions, varied learning objectives, and discusses some theoretical issues.
    Fairness Transferability Subject to Bounded Distribution Shift. (arXiv:2206.00129v1 [cs.LG])
    Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shift, a phenomenon frequently caused by user adaptation to a deployed model or a dynamic environment. Herein, we develop a bound characterizing such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violation as our primary result. We then develop bounds for specific worked examples, adopting two commonly used fairness definitions (i.e., demographic parity and equalized odds) for two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift as well as real-world data.
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v1 [stat.ML])
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike and slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.
    Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback. (arXiv:2205.07217v2 [cs.LG] UPDATED)
    Motivated by applications to online learning in sparse estimation and Bayesian optimization, we consider the problem of online unconstrained nonsubmodular minimization with delayed costs in both full information and bandit feedback settings. In contrast to previous works on online unconstrained submodular minimization, we focus on a class of nonsubmodular functions with special structure, and prove regret guarantees for several variants of the online and approximate online bandit gradient descent algorithms in static and delayed scenarios. We derive bounds for the agent's regret in the full information and bandit feedback setting, even if the delay between choosing a decision and receiving the incurred cost is unbounded. Key to our approach is the notion of $(\alpha, \beta)$-regret and the extension of the generic convex relaxation model from~\citet{El-2020-Optimal}, the analysis of which is of independent interest. We conduct and showcase several simulation studies to demonstrate the efficacy of our algorithms.
    AVIDA: Alternating method for Visualizing and Integrating Data. (arXiv:2206.00135v1 [q-bio.QM])
    High-dimensional multimodal data arises in many scientific fields. The integration of multimodal data becomes challenging when there is no known correspondence between the samples and the features of different datasets. To tackle this challenge, we introduce AVIDA, a framework for simultaneously performing data alignment and dimension reduction. In the numerical experiments, Gromov-Wasserstein optimal transport and t-distributed stochastic neighbor embedding are used as the alignment and dimension reduction modules respectively. We show that AVIDA correctly aligns high-dimensional datasets without common features with four synthesized datasets and two real multimodal single-cell datasets. Compared to several existing methods, we demonstrate that AVIDA better preserves structures of individual datasets, especially distinct local structures in the joint low-dimensional visualization, while achieving comparable alignment performance. Such a property is important in multimodal single-cell data analysis as some biological processes are uniquely captured by one of the datasets. In general applications, other methods can be used for the alignment and dimension reduction modules.
    Stochastic Gradient Methods with Preconditioned Updates. (arXiv:2206.00285v1 [math.OC])
    This work considers non-convex finite sum minimization. There are a number of algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner that is based upon Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient based methods to give new `scaled' algorithms: {\tt Scaled SARAH} and {\tt Scaled L-SVRG}. Theoretical complexity guarantees under smoothness assumptions are presented, and we prove linear convergence when both smoothness and the PL-condition is assumed. Because our adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.
    A robust and lightweight deep attention multiple instance learning algorithm for predicting genetic alterations. (arXiv:2206.00455v1 [q-bio.QM])
    Deep-learning models based on whole-slide digital pathology images (WSIs) become increasingly popular for predicting molecular biomarkers. Instance-based models has been the mainstream strategy for predicting genetic alterations using WSIs although bag-based models along with self-attention mechanism-based algorithms have been proposed for other digital pathology applications. In this paper, we proposed a novel Attention-based Multiple Instance Mutation Learning (AMIML) model for predicting gene mutations. AMIML was comprised of successive 1-D convolutional layers, a decoder, and a residual weight connection to facilitate further integration of a lightweight attention mechanism to detect the most predictive image patches. Using data for 24 clinically relevant genes from four cancer cohorts in The Cancer Genome Atlas (TCGA) studies (UCEC, BRCA, GBM and KIRC), we compared AMIML with one popular instance-based model and four recently published bag-based models (e.g., CHOWDER, HE2RNA, etc.). AMIML demonstrated excellent robustness, not only outperforming all the five baseline algorithms in the vast majority of the tested genes (17 out of 24), but also providing near-best-performance for the other seven genes. Conversely, the performance of the baseline published algorithms varied across different cancers/genes. In addition, compared to the published models for genetic alterations, AMIML provided a significant improvement for predicting a wide range of genes (e.g., KMT2C, TP53, and SETD2 for KIRC; ERBB2, BRCA1, and BRCA2 for BRCA; JAK1, POLE, and MTOR for UCEC) as well as produced outstanding predictive models for other clinically relevant gene mutations, which have not been reported in the current literature. Furthermore, with the flexible and interpretable attention-based MIL pooling mechanism, AMIML could further zero-in and detect predictive image patches.
    Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training. (arXiv:2205.10471v2 [cs.CL] UPDATED)
    Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text. Despite its recent flourishing, keyphrase generation on non-English languages haven't been vastly investigated. In this paper, we call attention to a new setting named multilingual keyphrase generation and we contribute two new datasets, EcommerceMKP and AcademicMKP, covering six languages. Technically, we propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages. The retrieval-augmented model leverages keyphrase annotations in English datasets to facilitate generating keyphrases in low-resource languages. Given a non-English passage, a cross-lingual dense passage retrieval module finds relevant English passages. Then the associated English keyphrases serve as external knowledge for keyphrase generation in the current language. Moreover, we develop a retriever-generator iterative training algorithm to mine pseudo parallel passage pairs to strengthen the cross-lingual passage retriever. Comprehensive experiments and ablations show that the proposed approach outperforms all baselines.
    Predicting Political Ideology from Digital Footprints. (arXiv:2206.00397v1 [econ.GN])
    This paper proposes a new method to predict individual political ideology from digital footprints on one of the world's largest online discussion forum. We compiled a unique data set from the online discussion forum reddit that contains information on the political ideology of around 91,000 users as well as records of their comment frequency and the comments' text corpus in over 190,000 different subforums of interest. Applying a set of statistical learning approaches, we show that information about activity in non-political discussion forums alone, can very accurately predict a user's political ideology. Depending on the model, we are able to predict the economic dimension of ideology with an accuracy of up to 90.63% and the social dimension with and accuracy of up to 82.02%. In comparison, using the textual features from actual comments does not improve predictive accuracy. Our paper highlights the importance of revealed digital behaviour to complement stated preferences from digital communication when analysing human preferences and behaviour using online data.
    Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation. (arXiv:2206.00270v1 [cs.LG])
    We study lifelong reinforcement learning (RL) in a regret minimization setting of linear contextual Markov decision process (MDP), where the agent needs to learn a multi-task policy while solving a streaming sequence of tasks. We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks, which may be adaptively chosen based on the agent's past behaviors. Remarkably, our algorithm uses only sublinear number of planning calls, which means that the agent eventually learns a policy that is near optimal for multiple tasks (seen or unseen) without the need of deliberate planning. A key to this property is a new structural assumption that enables computation sharing across tasks during exploration. Specifically, for $K$ task episodes of horizon $H$, our algorithm has a regret bound $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$ based on $\mathcal{O}(dH\log(K))$ number of planning calls, where $d$ and $d^\prime$ are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to accumulate experiences and learn to rapidly solve new tasks.
    Physical Modeling using Recurrent Neural Networks with Fast Convolutional Layers. (arXiv:2204.10125v2 [cs.SD] UPDATED)
    Discrete-time modeling of acoustic, mechanical and electrical systems is a prominent topic in the musical signal processing literature. Such models are mostly derived by discretizing a mathematical model, given in terms of ordinary or partial differential equations, using established techniques. Recent work has applied the techniques of machine-learning to construct such models automatically from data for the case of systems which have lumped states described by scalar values, such as electrical circuits. In this work, we examine how similar techniques are able to construct models of systems which have spatially distributed rather than lumped states. We describe several novel recurrent neural network structures, and show how they can be thought of as an extension of modal techniques. As a proof of concept, we generate synthetic data for three physical systems and show that the proposed network structures can be trained with this data to reproduce the behavior of these systems.
    Multi-block Min-max Bilevel Optimization with Applications in Multi-task Deep AUC Maximization. (arXiv:2206.00260v1 [math.OC])
    In this paper, we study multi-block min-max bilevel optimization problems, where the upper level is non-convex strongly-concave minimax objective and the lower level is a strongly convex objective, and there are multiple blocks of dual variables and lower level problems. Due to the intertwined multi-block min-max bilevel structure, the computational cost at each iteration could be prohibitively high, especially with a large number of blocks. To tackle this challenge, we present a single-loop randomized stochastic algorithm, which requires updates for only a constant number of blocks at each iteration. Under some mild assumptions on the problem, we establish its sample complexity of $\mathcal{O}(1/\epsilon^4)$ for finding an $\epsilon$-stationary point. This matches the optimal complexity for solving stochastic nonconvex optimization under a general unbiased stochastic oracle model. Moreover, we provide two applications of the proposed method in multi-task deep AUC (area under ROC curve) maximization and multi-task deep partial AUC maximization. Experimental results validate our theory and demonstrate the effectiveness of our method on problems with hundreds of tasks.
    Semantic Probabilistic Layers for Neuro-Symbolic Learning. (arXiv:2206.00426v1 [cs.LG])
    We design a predictive layer for structured-output prediction (SOP) that can be plugged into any neural network guaranteeing its predictions are consistent with a set of predefined symbolic constraints. Our Semantic Probabilistic Layer (SPL) can model intricate correlations, and hard constraints, over a structured output space all while being amenable to end-to-end learning via maximum likelihood. SPLs combine exact probabilistic inference with logical reasoning in a clean and modular way, learning complex distributions and restricting their support to solutions of the constraint. As such, they can faithfully, and efficiently, model complex SOP tasks beyond the reach of alternative neuro-symbolic approaches. We empirically demonstrate that SPLs outperform these competitors in terms of accuracy on challenging SOP tasks including hierarchical multi-label classification, pathfinding and preference learning, while retaining perfect constraint satisfaction.
    Top-down inference in an early visual cortex inspired hierarchical Variational Autoencoder. (arXiv:2206.00436v1 [q-bio.NC])
    Interpreting computations in the visual cortex as learning and inference in a generative model of the environment has received wide support both in neuroscience and cognitive science. However, hierarchical computations, a hallmark of visual cortical processing, has remained impervious for generative models because of a lack of adequate tools to address it. Here we capitalize on advances in Variational Autoencoders (VAEs) to investigate the early visual cortex with sparse coding hierarchical VAEs trained on natural images. We design alternative architectures that vary both in terms of the generative and the recognition components of the two latent-layer VAE. We show that representations similar to the one found in the primary and secondary visual cortices naturally emerge under mild inductive biases. Importantly, a nonlinear representation for texture-like patterns is a stable property of the high-level latent space resistant to the specific architecture of the VAE, reminiscent of the secondary visual cortex. We show that a neuroscience-inspired choice of the recognition model, which features a top-down processing component is critical for two signatures of computations with generative models: learning higher order moments of the posterior beyond the mean and image inpainting. Patterns in higher order response statistics provide inspirations for neuroscience to interpret response correlations and for machine learning to evaluate the learned representations through more detailed characterization of the posterior.
    Lower and Upper Bounds for Numbers of Linear Regions of Graph Convolutional Networks. (arXiv:2206.00228v1 [cs.LG])
    The research for characterizing GNN expressiveness attracts much attention as graph neural networks achieve a champion in the last five years. The number of linear regions has been considered a good measure for the expressivity of neural networks with piecewise linear activation. In this paper, we present some estimates for the number of linear regions of the classic graph convolutional networks (GCNs) with one layer and multiple-layer scenarios. In particular, we obtain an optimal upper bound for the maximum number of linear regions for one-layer GCNs, and the upper and lower bounds for multi-layer GCNs. The simulated estimate shows that the true maximum number of linear regions is possibly closer to our estimated lower bound. These results imply that the number of linear regions of multi-layer GCNs is exponentially greater than one-layer GCNs per parameter in general. This suggests that deeper GCNs have more expressivity than shallow GCNs.
    Continuous Prediction with Experts' Advice. (arXiv:2206.00236v1 [cs.LG])
    Prediction with experts' advice is one of the most fundamental problems in online learning and captures many of its technical challenges. A recent line of work has looked at online learning through the lens of differential equations and continuous-time analysis. This viewpoint has yielded optimal results for several problems in online learning. In this paper, we employ continuous-time stochastic calculus in order to study the discrete-time experts' problem. We use these tools to design a continuous-time, parameter-free algorithm with improved guarantees for the quantile regret. We then develop an analogous discrete-time algorithm with a very similar analysis and identical quantile regret bounds. Finally, we design an anytime continuous-time algorithm with regret matching the optimal fixed-time rate when the gains are independent Brownian Motions; in many settings, this is the most difficult case. This gives some evidence that, even with adversarial gains, the optimal anytime and fixed-time regrets may coincide.
    OOD Link Prediction Generalization Capabilities of Message-Passing GNNs in Larger Test Graphs. (arXiv:2205.15117v2 [cs.LG] UPDATED)
    This work provides the first theoretical study on the ability of graph Message Passing Neural Networks (gMPNNs) -- such as Graph Neural Networks (GNNs) -- to perform inductive out-of-distribution (OOD) link prediction tasks, where deployment (test) graph sizes are larger than training graphs. We first prove non-asymptotic bounds showing that link predictors based on permutation-equivariant (structural) node embeddings obtained by gMPNNs can converge to a random guess as test graphs get larger. We then propose a theoretically-sound gMPNN that outputs structural pairwise (2-node) embeddings and prove non-asymptotic bounds showing that, as test graphs grow, these embeddings converge to embeddings of a continuous function that retains its ability to predict links OOD. Empirical results on random graphs show agreement with our theoretical results.
    Contrastive Principal Component Learning: Modeling Similarity by Augmentation Overlap. (arXiv:2206.00471v1 [cs.LG])
    Traditional self-supervised contrastive learning methods learn embeddings by pulling views of the same sample together and pushing views of different samples away. Since views of a sample are usually generated via data augmentations, the semantic relationship between samples is ignored. Based on the observation that semantically similar samples are more likely to have similar augmentations, we propose to measure similarity via the distribution of augmentations, i.e., how much the augmentations of two samples overlap. To handle the dimensional and computational complexity, we propose a novel Contrastive Principal Component Learning (CPCL) method composed of a contrastive-like loss and an on-the-fly projection loss to efficiently perform PCA on the augmentation feature, which encodes the augmentation distribution. By CPCL, the learned low-dimensional embeddings theoretically preserve the similarity of augmentation distribution between samples. Empirical results show our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.
    On Layer Normalizations and Residual Connections in Transformers. (arXiv:2206.00330v1 [cs.LG])
    In the perspective of a layer normalization (LN) position, the architecture of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers prefer to select Pre-LN because the training in Post-LN with deep Transformers, e.g., ten or more layers, often becomes unstable, resulting in useless models. However, in contrast, Post-LN has also consistently achieved better performance than Pre-LN in relatively shallow Transformers, e.g., six or fewer layers. This study first investigates the reason for these discrepant observations empirically and theoretically and discovers 1, the LN in Post-LN is the source of the vanishing gradient problem that mainly leads the unstable training whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation that may lead an effective training. Exploiting the new findings, we propose a method that can equip both higher stability and effective training by a simple modification from Post-LN. We conduct experiments on a wide range of text generation tasks and demonstrate that our method outperforms Pre-LN, and stable training regardless of the shallow or deep layer settings.
    Incentivizing Combinatorial Bandit Exploration. (arXiv:2206.00494v1 [cs.LG])
    Consider a bandit algorithm that recommends actions to self-interested users in a recommendation system. The users are free to choose other actions and need to be incentivized to follow the algorithm's recommendations. While the users prefer to exploit, the algorithm can incentivize them to explore by leveraging the information collected from the previous users. All published work on this problem, known as incentivized exploration, focuses on small, unstructured action sets and mainly targets the case when the users' beliefs are independent across actions. However, realistic exploration problems often feature large, structured action sets and highly correlated beliefs. We focus on a paradigmatic exploration problem with structure: combinatorial semi-bandits. We prove that Thompson Sampling, when applied to combinatorial semi-bandits, is incentive-compatible when initialized with a sufficient number of samples of each arm (where this number is determined in advance by the Bayesian prior). Moreover, we design incentive-compatible algorithms for collecting the initial samples.
    A Generalized Supervised Contrastive Learning Framework. (arXiv:2206.00384v1 [cs.CV])
    Based on recent remarkable achievements of contrastive learning in self-supervised representation learning, supervised contrastive learning (SupCon) has successfully extended the batch contrastive approaches to the supervised context and outperformed cross-entropy on various datasets on ResNet. In this work, we present GenSCL: a generalized supervised contrastive learning framework that seamlessly adapts modern image-based regularizations (such as Mixup-Cutmix) and knowledge distillation (KD) to SupCon by our generalized supervised contrastive loss. Generalized supervised contrastive loss is a further extension of supervised contrastive loss measuring cross-entropy between the similarity of labels and that of latent features. Then a model can learn to what extent contrastives should be pulled closer to an anchor in the latent space. By explicitly and fully leveraging label information, GenSCL breaks the boundary between conventional positives and negatives, and any kind of pre-trained teacher classifier can be utilized. ResNet-50 trained in GenSCL with Mixup-Cutmix and KD achieves state-of-the-art accuracies of 97.6% and 84.7% on CIFAR10 and CIFAR100 without external data, which significantly improves the results reported in the original SupCon (1.6% and 8.2%, respectively). Pytorch implementation is available at https://t.ly/yuUO.
    A comparative study between vision transformers and CNNs in digital pathology. (arXiv:2206.00389v1 [eess.IV])
    Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficient amounts of data. In comparison to convolutional neural networks, vision transformers have a weaker inductive bias and therefore allow a more flexible feature detection. Due to their promising feature detection, this work explores vision transformers for tumor detection in digital pathology whole slide images in four tissue types, and for tissue type identification. We compared the patch-wise classification performance of the vision transformer DeiT-Tiny to the state-of-the-art convolutional neural network ResNet18. Due to the sparse availability of annotated whole slide images, we further compared both models pretrained on large amounts of unlabeled whole-slide images using state-of-the-art self-supervised approaches. The results show that the vision transformer performed slightly better than the ResNet18 for three of four tissue types for tumor detection while the ResNet18 performed slightly better for the remaining tasks. The aggregated predictions of both models on slide level were correlated, indicating that the models captured similar imaging features. All together, the vision transformer models performed on par with the ResNet18 while requiring more effort to train. In order to surpass the performance of convolutional neural networks, vision transformers might require more challenging tasks to benefit from their weak inductive bias.
    GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain. (arXiv:2109.02555v2 [cs.CL] UPDATED)
    Deep neural language models have set new breakthroughs in many tasks of Natural Language Processing (NLP). Recent work has shown that deep transformer language models (pretrained on large amounts of texts) can achieve high levels of task-specific few-shot performance comparable to state-of-the-art models. However, the ability of these large language models in few-shot transfer learning has not yet been explored in the biomedical domain. We investigated the performance of two powerful transformer language models, i.e. GPT-3 and BioBERT, in few-shot settings on various biomedical NLP tasks. The experimental results showed that, to a great extent, both the models underperform a language model fine-tuned on the full training data. Although GPT-3 had already achieved near state-of-the-art results in few-shot knowledge transfer on open-domain NLP tasks, it could not perform as effectively as BioBERT, which is orders of magnitude smaller than GPT-3. Regarding that BioBERT was already pretrained on large biomedical text corpora, our study suggests that language models may largely benefit from in-domain pretraining in task-specific few-shot learning. However, in-domain pretraining seems not to be sufficient; novel pretraining and few-shot learning strategies are required in the biomedical NLP domain.
    WaveMix-Lite: A Resource-efficient Neural Network for Image Analysis. (arXiv:2205.14375v2 [cs.CV] UPDATED)
    Gains in the ability to generalize on image analysis tasks for neural networks have come at the cost of increased number of parameters and layers, dataset sizes, training and test computations, and GPU RAM. We introduce a new architecture -- WaveMix-Lite -- that can generalize on par with contemporary transformers and convolutional neural networks (CNNs) while needing fewer resources. WaveMix-Lite uses 2D-discrete wavelet transform to efficiently mix spatial information from pixels. WaveMix-Lite seems to be a versatile and scalable architectural framework that can be used for multiple vision tasks, such as image classification and semantic segmentation, without requiring significant architectural changes, unlike transformers and CNNs. It is able to meet or exceed several accuracy benchmarks while training on a single GPU. For instance, it achieves state-of-the-art accuracy on five EMNIST datasets, outperforms CNNs and transformers in ImageNet-1K (64$\times$64 images), and achieves an mIoU of 75.32 % on Cityscapes validation set, while using less than one-fifth the number parameters and half the GPU RAM of comparable CNNs or transformers. Our experiments show that while the convolutional elements of neural architectures exploit the shift-invariance property of images, new types of layers (e.g., wavelet transform) can exploit additional properties of images, such as scale-invariance and finite spatial extents of objects.
    A model aggregation approach for high-dimensional large-scale optimization. (arXiv:2205.07525v2 [cs.LG] UPDATED)
    Bayesian optimization (BO) has been widely used in machine learning and simulation optimization. With the increase in computational resources and storage capacities in these fields, high-dimensional and large-scale problems are becoming increasingly common. In this study, we propose a model aggregation method in the Bayesian optimization (MamBO) algorithm for efficiently solving high-dimensional large-scale optimization problems. MamBO uses a combination of subsampling and subspace embeddings to collectively address high dimensionality and large-scale issues; in addition, a model aggregation method is employed to address the surrogate model uncertainty issue that arises when embedding is applied. This surrogate model uncertainty issue is largely ignored in the embedding literature and practice, and it is exacerbated when the problem is high-dimensional and data are limited. Our proposed model aggregation method reduces these lower-dimensional surrogate model risks and improves the robustness of the BO algorithm. We derive an asymptotic bound for the proposed aggregated surrogate model and prove the convergence of MamBO. Benchmark numerical experiments indicate that our algorithm achieves superior or comparable performance to other commonly used high-dimensional BO algorithms. Moreover, we apply MamBO to a cascade classifier of a machine learning algorithm for face detection, and the results reveal that MamBO finds settings that achieve higher classification accuracy than the benchmark settings and is computationally faster than other high-dimensional BO algorithms.
    Graph Neural Networks are Dynamic Programmers. (arXiv:2203.15544v2 [cs.LG] UPDATED)
    Recent advances in neural algorithmic reasoning with graph neural networks (GNNs) are propped up by the notion of algorithmic alignment. Broadly, a neural network will be better at learning to execute a reasoning task (in terms of sample complexity) if its individual components align well with the target algorithm. Specifically, GNNs are claimed to align with dynamic programming (DP), a general problem-solving strategy which expresses many polynomial-time algorithms. However, has this alignment truly been demonstrated and theoretically quantified? Here we show, using methods from category theory and abstract algebra, that there exists an intricate connection between GNNs and DP, going well beyond the initial observations over individual algorithms such as Bellman-Ford. Exposing this connection, we easily verify several prior findings in the literature, produce better-grounded GNN architectures for edge-centric tasks, and demonstrate empirical results on the CLRS algorithmic reasoning benchmark. We hope our exposition will serve as a foundation for building stronger algorithmically aligned GNNs.
    TEE-based decentralized recommender systems: The raw data sharing redemption. (arXiv:2202.11655v2 [cs.DC] UPDATED)
    Recommenders are central in many applications today. The most effective recommendation schemes, such as those based on collaborative filtering (CF), exploit similarities between user profiles to make recommendations, but potentially expose private data. Federated learning and decentralized learning systems address this by letting the data stay on user's machines to preserve privacy: each user performs the training on local data and only the model parameters are shared. However, sharing the model parameters across the network may still yield privacy breaches. In this paper, we present REX, the first enclave-based decentralized CF recommender. REX exploits Trusted execution environments (TEE), such as Intel software guard extensions (SGX), that provide shielded environments within the processor to improve convergence while preserving privacy. Firstly, REX enables raw data sharing, which ultimately speeds up convergence and reduces the network load. Secondly, REX fully preserves privacy. We analyze the impact of raw data sharing in both deep neural network (DNN) and matrix factorization (MF) recommenders and showcase the benefits of trusted environments in a full-fledged implementation of REX. Our experimental results demonstrate that through raw data sharing, REX significantly decreases the training time by 18.3x and the network load by 2 orders of magnitude over standard decentralized approaches that share only parameters, while fully protecting privacy by leveraging trustworthy hardware enclaves with very little overhead.
    Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography. (arXiv:2203.11579v2 [quant-ph] UPDATED)
    We propose a distributed Quantum State Tomography (QST) protocol, named Local Stochastic Factored Gradient Descent (Local SFGD), to learn the low-rank factor of a density matrix over a set of local machines. QST is the canonical procedure to characterize the state of a quantum system, which we formulate as a stochastic nonconvex smooth optimization problem. Physically, the estimation of a low-rank density matrix helps characterizing the amount of noise introduced by quantum computation. Theoretically, we prove the local convergence of Local SFGD for a general class of restricted strongly convex/smooth loss functions, i.e., Local SFGD converges locally to a small neighborhood of the global optimum at a linear rate with a constant step size, while it locally converges exactly at a sub-linear rate with diminishing step sizes. With a proper initialization, local convergence results imply global convergence. We validate our theoretical findings with numerical simulations of QST on the Greenberger-Horne-Zeilinger (GHZ) state.
    Fishr: Invariant Gradient Variances for Out-of-Distribution Generalization. (arXiv:2109.02934v3 [cs.LG] UPDATED)
    Learning robust models that generalize well under changes in the data distribution is critical for real-world applications. To this end, there has been a growing surge of interest to learn simultaneously from multiple training domains - while enforcing different types of invariance across those domains. Yet, all existing approaches fail to show systematic benefits under controlled evaluation protocols. In this paper, we introduce a new regularization - named Fishr - that enforces domain invariance in the space of the gradients of the loss: specifically, the domain-level variances of gradients are matched across training domains. Our approach is based on the close relations between the gradient covariance, the Fisher Information and the Hessian of the loss: in particular, we show that Fishr eventually aligns the domain-level loss landscapes locally around the final weights. Extensive experiments demonstrate the effectiveness of Fishr for out-of-distribution generalization. Notably, Fishr improves the state of the art on the DomainBed benchmark and performs consistently better than Empirical Risk Minimization. Our code is available at https://github.com/alexrame/fishr.
    AgraSSt: Approximate Graph Stein Statistics for Interpretable Assessment of Implicit Graph Generators. (arXiv:2203.03673v2 [stat.ML] UPDATED)
    We propose and analyse a novel statistical procedure, coined AgraSSt, to assess the quality of graph generators that may not be available in explicit form. In particular, AgraSSt can be used to determine whether a learnt graph generating process is capable of generating graphs that resemble a given input graph. Inspired by Stein operators for random graphs, the key idea of AgraSSt is the construction of a kernel discrepancy based on an operator obtained from the graph generator. AgraSSt can provide interpretable criticisms for a graph generator training procedure and help identify reliable sample batches for downstream tasks. Using Stein`s method we give theoretical guarantees for a broad class of random graph models. We provide empirical results on both synthetic input graphs with known graph generation procedures, and real-world input graphs that the state-of-the-art (deep) generative models for graphs are trained on.
    Online Learning for Min Sum Set Cover and Pandora's Box. (arXiv:2202.04870v2 [cs.LG] UPDATED)
    Two central problems in Stochastic Optimization are Min Sum Set Cover and Pandora's Box. In Pandora's Box, we are presented with $n$ boxes, each containing an unknown value and the goal is to open the boxes in some order to minimize the sum of the search cost and the smallest value found. Given a distribution of value vectors, we are asked to identify a near-optimal search order. Min Sum Set Cover corresponds to the case where values are either 0 or infinity. In this work, we study the case where the value vectors are not drawn from a distribution but are presented to a learner in an online fashion. We present a computationally efficient algorithm that is constant-competitive against the cost of the optimal search order. We extend our results to a bandit setting where only the values of the boxes opened are revealed to the learner after every round. We also generalize our results to other commonly studied variants of Pandora's Box and Min Sum Set Cover that involve selecting more than a single value subject to a matroid constraint.
    Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization. (arXiv:2203.02214v2 [cs.LG] UPDATED)
    Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.
    StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2. (arXiv:2112.14683v4 [cs.CV] UPDATED)
    Videos show continuous events, yet most $-$ if not all $-$ video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be $-$ time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image + video discriminators pair and design a holistic discriminator that aggregates temporal information by simply concatenating frames' features. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024$^2$ videos for the first time. We build our model on top of StyleGAN2 and it is just ${\approx}5\%$ more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model is tested on four modern 256$^2$ and one 1024$^2$-resolution video synthesis benchmarks. In terms of sheer metrics, it performs on average ${\approx}30\%$ better than the closest runner-up. Project website: https://universome.github.io.
    Model Generation with Provable Coverability for Offline Reinforcement Learning. (arXiv:2206.00316v1 [cs.LG])
    Model-based offline optimization with dynamics-aware policy provides a new perspective for policy learning and out-of-distribution generalization, where the learned policy could adapt to different dynamics enumerated at the training stage. But due to the limitation under the offline setting, the learned model could not mimic real dynamics well enough to support reliable out-of-distribution exploration, which still hinders policy to generalize well. To narrow the gap, previous works roughly ensemble randomly initialized models to better approximate the real dynamics. However, such practice is costly and inefficient, and provides no guarantee on how well the real dynamics could be approximated by the learned models, which we name coverability in this paper. We actively address this issue by generating models with provable ability to cover real dynamics in an efficient and controllable way. To that end, we design a distance metric for dynamic models based on the occupancy of policies under the dynamics, and propose an algorithm to generate models optimizing their coverage for the real dynamics. We give a theoretical analysis on the model generation process and proves that our algorithm could provide enhanced coverability. As a downstream task, we train a dynamics-aware policy with minor or no conservative penalty, and experiments demonstrate that our algorithm outperforms prior offline methods on existing offline RL benchmarks. We also discover that policies learned by our method have better zero-shot transfer performance, implying their better generalization.
    Towards Context-Aware Neural Performance-Score Synchronisation. (arXiv:2206.00454v1 [cs.SD])
    Music can be represented in multiple forms, such as in the audio form as a recording of a performance, in the symbolic form as a computer readable score, or in the image form as a scan of the sheet music. Music synchronisation provides a way to navigate among multiple representations of music in a unified manner by generating an accurate mapping between them, lending itself applicable to a myriad of domains like music education, performance analysis, automatic accompaniment and music editing. Traditional synchronisation methods compute alignment using knowledge-driven and stochastic approaches, typically employing handcrafted features. These methods are often unable to generalise well to different instruments, acoustic environments and recording conditions, and normally assume complete structural agreement between the performances and the scores. This PhD furthers the development of performance-score synchronisation research by proposing data-driven, context-aware alignment approaches, on three fronts: Firstly, I replace the handcrafted features by employing a metric learning based approach that is adaptable to different acoustic settings and performs well in data-scarce conditions. Secondly, I address the handling of structural differences between the performances and scores, which is a common limitation of standard alignment methods. Finally, I eschew the reliance on both feature engineering and dynamic programming, and propose a completely data-driven synchronisation method that computes alignments using a neural framework, whilst also being robust to structural differences between the performances and scores.
    Learning from Small Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v4 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.
    Can Mean Field Control (MFC) Approximate Cooperative Multi Agent Reinforcement Learning (MARL) with Non-Uniform Interaction?. (arXiv:2203.00035v2 [cs.LG] UPDATED)
    Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$.
    Neural Dual Contouring. (arXiv:2202.01999v3 [cs.CV] UPDATED)
    We introduce neural dual contouring (NDC), a new data-driven approach to mesh reconstruction based on dual contouring (DC). Like traditional DC, it produces exactly one vertex per grid cell and one quad for each grid edge intersection, a natural and efficient structure for reproducing sharp features. However, rather than computing vertex locations and edge crossings with hand-crafted functions that depend directly on difficult-to-obtain surface gradients, NDC uses a neural network to predict them. As a result, NDC can be trained to produce meshes from signed or unsigned distance fields, binary voxel grids, or point clouds (with or without normals); and it can produce open surfaces in cases where the input represents a sheet or partial surface. During experiments with five prominent datasets, we find that NDC, when trained on one of the datasets, generalizes well to the others. Furthermore, NDC provides better surface reconstruction accuracy, feature preservation, output complexity, triangle quality, and inference time in comparison to previous learned (e.g., neural marching cubes, convolutional occupancy networks) and traditional (e.g., Poisson) methods. Code and data are available at https://github.com/czq142857/NDC.
    Graph Self-supervised Learning with Accurate Discrepancy Learning. (arXiv:2202.02989v3 [cs.LG] UPDATED)
    Self-supervised learning of graph neural networks (GNNs) aims to learn an accurate representation of the graphs in an unsupervised manner, to obtain transferable representations of them for diverse downstream tasks. Predictive learning and contrastive learning are the two most prevalent approaches for graph self-supervised learning. However, they have their own drawbacks. While the predictive learning methods can learn the contextual relationships between neighboring nodes and edges, they cannot learn global graph-level similarities. Contrastive learning, while it can learn global graph-level similarities, its objective to maximize the similarity between two differently perturbed graphs may result in representations that cannot discriminate two similar graphs with different properties. To tackle such limitations, we propose a framework that aims to learn the exact discrepancy between the original and the perturbed graphs, coined as Discrepancy-based Self-supervised LeArning (D-SLA). Specifically, we create multiple perturbations of the given graph with varying degrees of similarity, and train the model to predict whether each graph is the original graph or the perturbed one. Moreover, we further aim to accurately capture the amount of discrepancy for each perturbed graph using the graph edit distance. We validate our D-SLA on various graph-related downstream tasks, including molecular property prediction, protein function prediction, and link prediction tasks, on which ours largely outperforms relevant baselines.
    Sampling from Log-Concave Distributions with Infinity-Distance Guarantees. (arXiv:2111.04089v2 [cs.DS] UPDATED)
    For a $d$-dimensional log-concave distribution $\pi(\theta) \propto e^{-f(\theta)}$ constrained to a convex body $K$, the problem of outputting samples from a distribution $\nu$ which is $\varepsilon$-close in infinity-distance $\sup_{\theta \in K} |\log \frac{\nu(\theta)}{\pi(\theta)}|$ to $\pi$ arises in differentially private optimization. While sampling within total-variation distance $\varepsilon$ of $\pi$ can be done by algorithms whose runtime depends polylogarithmically on $\frac{1}{\varepsilon}$, prior algorithms for sampling in $\varepsilon$ infinity distance have runtime bounds that depend polynomially on $\frac{1}{\varepsilon}$. We bridge this gap by presenting an algorithm that outputs a point $\varepsilon$-close to $\pi$ in infinity distance that requires at most $\mathrm{poly}(\log \frac{1}{\varepsilon}, d)$ calls to a membership oracle for $K$ and evaluation oracle for $f$, when $f$ is Lipschitz. Our approach departs from prior works that construct Markov chains on a $\frac{1}{\varepsilon^2}$-discretization of $K$ to achieve a sample with $\varepsilon$ infinity-distance error, and present a method to directly convert continuous samples from $K$ with total-variation bounds to samples with infinity bounds. This approach also allows us to obtain an improvement on the dimension $d$ in the running time for the problem of sampling from a log-concave distribution on polytopes $K$ with infinity distance $\varepsilon$, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain.
    Conformal prediction for the design problem. (arXiv:2202.03613v4 [cs.LG] UPDATED)
    Many applications of machine learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, one data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences, then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet lab is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data in the design setting -- one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data -- that is, the designed sequences -- has an unknown and possibly complex relationship with its error on the training data. We introduce a method to quantify predictive uncertainty in such settings. We do so by constructing confidence sets for predictions that account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any prediction algorithm, even when a trained model chooses the test-time input distribution. As a motivating use case, we demonstrate with several real data sets how our method quantifies uncertainty for the predicted fitness of designed proteins, and can therefore be used to select design algorithms that achieve acceptable trade-offs between high predicted fitness and low predictive uncertainty.
    Transformer with Fourier Integral Attentions. (arXiv:2206.00206v1 [cs.LG])
    Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
    Multi-Complexity-Loss DNAS for Energy-Efficient and Memory-Constrained Deep Neural Networks. (arXiv:2206.00302v1 [cs.LG])
    Neural Architecture Search (NAS) is increasingly popular to automatically explore the accuracy versus computational complexity trade-off of Deep Learning (DL) architectures. When targeting tiny edge devices, the main challenge for DL deployment is matching the tight memory constraints, hence most NAS algorithms consider model size as the complexity metric. Other methods reduce the energy or latency of DL models by trading off accuracy and number of inference operations. Energy and memory are rarely considered simultaneously, in particular by low-search-cost Differentiable NAS (DNAS) solutions. We overcome this limitation proposing the first DNAS that directly addresses the most realistic scenario from a designer's perspective: the co-optimization of accuracy and energy (or latency) under a memory constraint, determined by the target HW. We do so by combining two complexity-dependent loss functions during training, with independent strength. Testing on three edge-relevant tasks from the MLPerf Tiny benchmark suite, we obtain rich Pareto sets of architectures in the energy vs. accuracy space, with memory footprints constraints spanning from 75% to 6.25% of the baseline networks. When deployed on a commercial edge device, the STM NUCLEO-H743ZI2, our networks span a range of 2.18x in energy consumption and 4.04% in accuracy for the same memory constraint, and reduce energy by up to 2.2x with negligible accuracy drop with respect to the baseline.
    Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model. (arXiv:2108.07467v3 [cs.SD] UPDATED)
    Abdominal auscultation is a convenient, safe and inexpensive method to assess bowel conditions, which is essential in neonatal care. It helps early detection of neonatal bowel dysfunctions and allows timely intervention. This paper presents a neonatal bowel sound detection method to assist the auscultation. Specifically, a Convolutional Neural Network (CNN) is proposed to classify peristalsis and non-peristalsis sounds. The classification is then optimized using a Laplace Hidden Semi-Markov Model (HSMM). The proposed method is validated on abdominal sounds from 49 newborn infants admitted to our tertiary Neonatal Intensive Care Unit (NICU). The results show that the method can effectively detect bowel sounds with accuracy and area under curve (AUC) score being 89.81% and 83.96% respectively, outperforming 13 baseline methods. Furthermore, the proposed Laplace HSMM refinement strategy is proven capable to enhance other bowel sound detection models. The outcomes of this work have the potential to facilitate future telehealth applications for neonatal care. The source code of our work can be found at: https://bitbucket.org/chirudeakin/neonatal-bowel-sound-classification/
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v2 [cs.LG] UPDATED)
    We propose a simple and scalable approach to causal representation learning for multitask learning. Our approach requires minimal modification to existing ML systems, and improves robustness to target shift. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as target-causing confounders. These confounders induce spurious dependencies between the input and targets. This poses a problem for the conventional approach to multitask learning, due to its assumption that the targets are conditionally independent given the input. Our proposed approach takes into account the dependencies between the targets in order to alleviate target-causing confounding. All that is required in addition to usual practice is to estimate the joint distribution of the targets to switch from discriminative to generative classification, and to predict all targets jointly. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to target shift.
    Human-Algorithm Collaboration: Achieving Complementarity and Avoiding Unfairness. (arXiv:2202.08821v2 [cs.CY] UPDATED)
    Much of machine learning research focuses on predictive accuracy: given a task, create a machine learning model (or algorithm) that maximizes accuracy. In many settings, however, the final prediction or decision of a system is under the control of a human, who uses an algorithm's output along with their own personal expertise in order to produce a combined prediction. One ultimate goal of such collaborative systems is "complementarity": that is, to produce lower loss (equivalently, greater payoff or utility) than either the human or algorithm alone. However, experimental results have shown that even in carefully-designed systems, complementary performance can be elusive. Our work provides three key contributions. First, we provide a theoretical framework for modeling simple human-algorithm systems and demonstrate that multiple prior analyses can be expressed within it. Next, we use this model to prove conditions where complementarity is impossible, and give constructive examples of where complementarity is achievable. Finally, we discuss the implications of our findings, especially with respect to the fairness of a classifier. In sum, these results deepen our understanding of key factors influencing the combined performance of human-algorithm systems, giving insight into how algorithmic tools can best be designed for collaborative environments.
    Optimal Accounting of Differential Privacy via Characteristic Function. (arXiv:2106.08567v3 [cs.LG] UPDATED)
    Characterizing the privacy degradation over compositions, i.e., privacy accounting, is a fundamental topic in differential privacy (DP) with many applications to differentially private machine learning and federated learning. We propose a unification of recent advances (Renyi DP, privacy profiles, $f$-DP and the PLD formalism) via the \emph{characteristic function} ($\phi$-function) of a certain \emph{dominating} privacy loss random variable. We show that our approach allows \emph{natural} adaptive composition like Renyi DP, provides \emph{exactly tight} privacy accounting like PLD, and can be (often \emph{losslessly}) converted to privacy profile and $f$-DP, thus providing $(\epsilon,\delta)$-DP guarantees and interpretable tradeoff functions. Algorithmically, we propose an \emph{analytical Fourier accountant} that represents the \emph{complex} logarithm of $\phi$-functions symbolically and uses Gaussian quadrature for numerical computation. On several popular DP mechanisms and their subsampled counterparts, we demonstrate the flexibility and tightness of our approach in theory and experiments.
    Fair Comparison between Efficient Attentions. (arXiv:2206.00244v1 [cs.CV])
    Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of transformers in various vision tasks that require dense prediction. Many studies aiming at solving this problem have been reported proposed. However, no comparative study of these methods using the same scale has been reported due to different model configurations, training schemes, and new methods. In our paper, we validate these efficient attention models on the ImageNet1K classification task by changing only the attention operation and examining which efficient attention is better.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v2 [stat.ML] UPDATED)
    A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at each iteration of stochastic gradient descent. In this paper, we study the effects of adding an $\ell_2$ penalty of the embedding vectors to the training loss of these types of methods. We prove that, under some exchangeability assumptions on the graph, this asymptotically leads to learning a graphon with a nuclear-norm-type penalty, and give guarantees for the asymptotic distribution of the learned embedding vectors. In particular, the exact form of the penalty depends on the choice of subsampling method used as part of stochastic gradient descent. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, when not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Bayesian Optimisation for Robust Model Predictive Control under Model Parameter Uncertainty. (arXiv:2203.00551v3 [cs.RO] UPDATED)
    We propose an adaptive optimisation approach for tuning stochastic model predictive control (MPC) hyper-parameters while jointly estimating probability distributions of the transition model parameters based on performance rewards. In particular, we develop a Bayesian optimisation (BO) algorithm with a heteroscedastic noise model to deal with varying noise across the MPC hyper-parameter and dynamics model parameter spaces. Typical homoscedastic noise models are unrealistic for tuning MPC since stochastic controllers are inherently noisy, and the level of noise is affected by their hyper-parameter settings. We evaluate the proposed optimisation algorithm in simulated control and robotics tasks where we jointly infer control and dynamics parameters. Experimental results demonstrate that our approach leads to higher cumulative rewards and more stable controllers.
    Neural Network Verification with Proof Production. (arXiv:2206.00512v1 [cs.LO])
    Deep neural networks (DNNs) are increasingly being employed in safety-critical systems, and there is an urgent need to guarantee their correctness. Consequently, the verification community has devised multiple techniques and tools for verifying DNNs. When DNN verifiers discover an input that triggers an error, that is easy to confirm; but when they report that no error exists, there is no way to ensure that the verification tool itself is not flawed. As multiple errors have already been observed in DNN verification tools, this calls the applicability of DNN verification into question. In this work, we present a novel mechanism for enhancing Simplex-based DNN verifiers with proof production capabilities: the generation of an easy-to-check witness of unsatisfiability, which attests to the absence of errors. Our proof production is based on an efficient adaptation of the well-known Farkas' lemma, combined with mechanisms for handling piecewise-linear functions and numerical precision errors. As a proof of concept, we implemented our technique on top of the Marabou DNN verifier. Our evaluation on a safety-critical system for airborne collision avoidance shows that proof production succeeds in almost all cases and requires only minimal overhead.  ( 2 min )
    Rotate the ReLU to implicitly sparsify deep networks. (arXiv:2206.00488v1 [cs.LG])
    In the era of Deep Neural Network based solutions for a variety of real-life tasks, having a compact and energy-efficient deployable model has become fairly important. Most of the existing deep architectures use Rectifier Linear Unit (ReLU) activation. In this paper, we propose a novel idea of rotating the ReLU activation to give one more degree of freedom to the architecture. We show that this activation wherein the rotation is learned via training results in the elimination of those parameters/filters in the network which are not important for the task. In other words, rotated ReLU seems to be doing implicit sparsification. The slopes of the rotated ReLU activations act as coarse feature extractors and unnecessary features can be eliminated before retraining. Our studies indicate that features always choose to pass through a lesser number of filters in architectures such as ResNet and its variants. Hence, by rotating the ReLU, the weights or the filters that are not necessary are automatically identified and can be dropped thus giving rise to significant savings in memory and computation. Furthermore, in some cases, we also notice that along with saving in memory and computation we also obtain improvement over the reported performance of the corresponding baseline work in the popular datasets such as MNIST, CIFAR-10, CIFAR-100, and SVHN.  ( 2 min )
    A Transformer-based Network for Deformable Medical Image Registration. (arXiv:2202.12104v2 [eess.IV] UPDATED)
    Deformable medical image registration plays an important role in clinical diagnosis and treatment. Recently, the deep learning (DL) based image registration methods have been widely investigated and showed excellent performance in computational speed. However, these methods cannot provide enough registration accuracy because of insufficient ability in representing both the global and local features of the moving and fixed images. To address this issue, this paper has proposed the transformer based image registration method. This method uses the distinctive transformer to extract the global and local image features for generating the deformation fields, based on which the registered image is produced in an unsupervised way. Our method can improve the registration accuracy effectively by means of self-attention mechanism and bi-level information flow. Experimental results on such brain MR image datasets as LPBA40 and OASIS-1 demonstrate that compared with several traditional and DL based registration methods, our method provides higher registration accuracy in terms of dice values.  ( 2 min )
    On the Perils of Cascading Robust Classifiers. (arXiv:2206.00278v1 [cs.LG])
    Ensembling certifiably robust neural networks has been shown to be a promising approach for improving the \emph{certified robust accuracy} of neural models. Black-box ensembles that assume only query-access to the constituent models (and their robustness certifiers) during prediction are particularly attractive due to their modular structure. Cascading ensembles are a popular instance of black-box ensembles that appear to improve certified robust accuracies in practice. However, we find that the robustness certifier used by a cascading ensemble is unsound. That is, when a cascading ensemble is certified as locally robust at an input $x$, there can, in fact, be inputs $x'$ in the $\epsilon$-ball centered at $x$, such that the cascade's prediction at $x'$ is different from $x$. We present an alternate black-box ensembling mechanism based on weighted voting which we prove to be sound for robustness certification. Via a thought experiment, we demonstrate that if the constituent classifiers are suitably diverse, voting ensembles can improve certified performance. Our code is available at \url{https://github.com/TristaChi/ensembleKW}.  ( 2 min )
    The Dimpled Manifold Model of Adversarial Examples in Machine Learning. (arXiv:2106.10151v2 [cs.LG] UPDATED)
    The extreme fragility of deep neural networks, when presented with tiny perturbations in their inputs, was independently discovered by several research groups in 2013. However, despite enormous effort, these adversarial examples remained a counterintuitive phenomenon with no simple testable explanation. In this paper, we introduce a new conceptual framework for how the decision boundary between classes evolves during training, which we call the {\em Dimpled Manifold Model}. In particular, we demonstrate that training is divided into two distinct phases. The first phase is a (typically fast) clinging process in which the initially randomly oriented decision boundary gets very close to the low dimensional image manifold, which contains all the training examples. Next, there is a (typically slow) dimpling phase which creates shallow bulges in the decision boundary that move it to the correct side of the training examples. This framework provides a simple explanation for why adversarial examples exist, why their perturbations have such tiny norms, and why they look like random noise rather than like the target class. This explanation is also used to show that a network that was adversarially trained with incorrectly labeled images might still correctly classify most test images, and to show that the main effect of adversarial training is just to deepen the generated dimples in the decision boundary. Finally, we discuss and demonstrate the very different properties of on-manifold and off-manifold adversarial perturbations. We describe the results of numerous experiments which strongly support this new model, using both low dimensional synthetic datasets and high dimensional natural datasets.  ( 2 min )
    Hopular: Modern Hopfield Networks for Tabular Data. (arXiv:2206.00664v1 [cs.LG])
    While Deep Learning excels in structured data as encountered in vision and natural language processing, it failed to meet its expectations on tabular data. For tabular data, Support Vector Machines (SVMs), Random Forests, and Gradient Boosting are the best performing techniques with Gradient Boosting in the lead. Recently, we saw a surge of Deep Learning methods that were tailored to tabular data but still underperform compared to Gradient Boosting on small-sized datasets. We suggest "Hopular", a novel Deep Learning architecture for medium- and small-sized datasets, where each layer is equipped with continuous modern Hopfield networks. The modern Hopfield networks use stored data to identify feature-feature, feature-target, and sample-sample dependencies. Hopular's novelty is that every layer can directly access the original input as well as the whole training set via stored data in the Hopfield networks. Therefore, Hopular can step-wise update its current model and the resulting prediction at every layer like standard iterative learning algorithms. In experiments on small-sized tabular datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods. In experiments on medium-sized tabular data with about 10,000 samples, Hopular outperforms XGBoost, CatBoost, LightGBM and a state-of-the art Deep Learning method designed for tabular data. Thus, Hopular is a strong alternative to these methods on tabular data.  ( 2 min )
    A Comparison of Different Approaches to Dynamic Origin-Destination Matrix Estimation in Urban Traffic. (arXiv:2202.00099v2 [math.OC] UPDATED)
    Given the counters of vehicles that traverse the roads of a traffic network, we reconstruct the travel demand that generated them expressed in terms of the number of origin-destination trips made by users. We model the problem as a bi-level optimization problem. At the inner-level, given a tentative demand, we solve a Dynamic Traffic Assignment (DTA) problem to decide the routing of the users between their origins and destinations. Finally, we adjust the number of trips and their origins and destinations at the outer-level to minimize the discrepancy between the counters generated at the inner-level and the given vehicle counts measured by sensors in the traffic network. We solve the DTA problem by employing a mesoscopic model implemented by the traffic simulator SUMO. Thus, the outer problem becomes an optimization problem that minimizes a black-box Objective Function (OF) determined by the results of the simulation, which is a costly computation. We study different approaches to the outer-level problem categorized as gradient-based and derivative-free approaches. Among the gradient-based approaches, we look at an assignment matrix-based approach and an assignment matrix-free approach that uses the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm. Among the derivative-free approaches, we investigate Machine Learning (ML) algorithms to learn a model of the simulator that can then be used as a surrogate OF in the optimization problem. We compare these approaches computationally on an artificial network. The gradient-based approaches perform the best in terms of solution quality and computational requirements. In contrast, the results obtained by the ML approach are currently less satisfactory but provide an interesting avenue for future research.
    FETA: Fairness Enforced Verifying, Training, and Predicting Algorithms for Neural Networks. (arXiv:2206.00553v1 [cs.LG])
    Algorithmic decision making driven by neural networks has become very prominent in applications that directly affect people's quality of life. In this paper, we study the problem of verifying, training, and guaranteeing individual fairness of neural network models. A popular approach for enforcing fairness is to translate a fairness notion into constraints over the parameters of the model. However, such a translation does not always guarantee fair predictions of the trained neural network model. To address this challenge, we develop a counterexample-guided post-processing technique to provably enforce fairness constraints at prediction time. Contrary to prior work that enforces fairness only on points around test or train data, we are able to enforce and guarantee fairness on all points in the input domain. Additionally, we propose an in-processing technique to use fairness as an inductive bias by iteratively incorporating fairness counterexamples in the learning process. We have implemented these techniques in a tool called FETA. Empirical evaluation on real-world datasets indicates that FETA is not only able to guarantee fairness on-the-fly at prediction time but also is able to train accurate models exhibiting a much higher degree of individual fairness.
    Asymptotics of Network Embeddings Learned via Subsampling. (arXiv:2107.02363v2 [stat.ML] UPDATED)
    Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
    Asynchronous Hierarchical Federated Learning. (arXiv:2206.00054v1 [cs.LG])
    Federated Learning is a rapidly growing area of research and with various benefits and industry applications. Typical federated patterns have some intrinsic issues such as heavy server traffic, long periods of convergence, and unreliable accuracy. In this paper, we address these issues by proposing asynchronous hierarchical federated learning, in which the central server uses either the network topology or some clustering algorithm to assign clusters for workers (i.e., client devices). In each cluster, a special aggregator device is selected to enable hierarchical learning, leads to efficient communication between server and workers, so that the burden of the server can be significantly reduced. In addition, asynchronous federated learning schema is used to tolerate heterogeneity of the system and achieve fast convergence, i.e., the server aggregates the gradients from the workers weighted by a staleness parameter to update the global model, and regularized stochastic gradient descent is performed in workers, so that the instability of asynchronous learning can be alleviated. We evaluate the proposed algorithm on CIFAR-10 image classification task, the experimental results demonstrate the effectiveness of asynchronous hierarchical federated learning.
    Calibrate and Debias Layer-wise Sampling for Graph Convolutional Networks. (arXiv:2206.00583v1 [cs.LG])
    To accelerate the training of graph convolutional networks (GCNs), many sampling-based methods have been developed for approximating the embedding aggregation. Among them, a layer-wise approach recursively performs importance sampling to select neighbors jointly for existing nodes in each layer. This paper revisits the approach from a matrix approximation perspective. We identify two issues in the existing layer-wise sampling methods: sub-optimal sampling probabilities and the approximation bias induced by sampling without replacement. We propose two remedies: new sampling probabilities and a debiasing algorithm, to address these issues, and provide the statistical analysis of the estimation variance. The improvements are demonstrated by extensive analyses and experiments on common benchmarks.
    Graph Neural Networks with Precomputed Node Features. (arXiv:2206.00637v1 [cs.LG])
    Most Graph Neural Networks (GNNs) cannot distinguish some graphs or indeed some pairs of nodes within a graph. This makes it impossible to solve certain classification tasks. However, adding additional node features to these models can resolve this problem. We introduce several such augmentations, including (i) positional node embeddings, (ii) canonical node IDs, and (iii) random features. These extensions are motivated by theoretical results and corroborated by extensive testing on synthetic subgraph detection tasks. We find that positional embeddings significantly outperform other extensions in these tasks. Moreover, positional embeddings have better sample efficiency, perform well on different graph distributions and even outperform learning with ground truth node positions. Finally, we show that the different augmentations perform competitively on established GNN benchmarks, and advise on when to use them.
    Consistent Collaborative Filtering via Tensor Decomposition. (arXiv:2201.11936v2 [cs.IR] UPDATED)
    Collaborative filtering is the de facto standard for analyzing users' activities and building recommendation systems for items. In this work we develop Sliced Anti-symmetric Decomposition (SAD), a new model for collaborative filtering based on implicit feedback. In contrast to traditional techniques where a latent representation of users (user vectors) and items (item vectors) are estimated, SAD introduces one additional latent vector to each item, using a novel three-way tensor view of user-item interactions. This new vector extends user-item preferences calculated by standard dot products to general inner products, producing interactions between items when evaluating their relative preferences and bringing fundamental new information into recommendation. SAD reduces to state-of-the-art (SOTA) collaborative filtering models when the vector collapses to 1 (no new information), while in this paper we allow its value to be estimated from data. The proposed SAD model is simple, resulting in an efficient group stochastic gradient descent (SGD) algorithm. We demonstrate the efficiency of SAD in both simulated and real world datasets containing over 1M user-item interactions. By comparing SAD with seven alternative SOTA collaborative filtering models, we show that SAD is not only able to more consistently estimate personalized preferences, but also produce more accurate personalized recommendations. We release the model and inference algorithms in a Python library https://github.com/apple/ml-sad.
    NeuroUnlock: Unlocking the Architecture of Obfuscated Deep Neural Networks. (arXiv:2206.00402v1 [cs.CR])
    The advancements of deep neural networks (DNNs) have led to their deployment in diverse settings, including safety and security-critical applications. As a result, the characteristics of these models have become sensitive intellectual properties that require protection from malicious users. Extracting the architecture of a DNN through leaky side-channels (e.g., memory access) allows adversaries to (i) clone the model, and (ii) craft adversarial attacks. DNN obfuscation thwarts side-channel-based architecture stealing (SCAS) attacks by altering the run-time traces of a given DNN while preserving its functionality. In this work, we expose the vulnerability of state-of-the-art DNN obfuscation methods to these attacks. We present NeuroUnlock, a novel SCAS attack against obfuscated DNNs. Our NeuroUnlock employs a sequence-to-sequence model that learns the obfuscation procedure and automatically reverts it, thereby recovering the original DNN architecture. We demonstrate the effectiveness of NeuroUnlock by recovering the architecture of 200 randomly generated and obfuscated DNNs running on the Nvidia RTX 2080 TI graphics processing unit (GPU). Moreover, NeuroUnlock recovers the architecture of various other obfuscated DNNs, such as the VGG-11, VGG-13, ResNet-20, and ResNet-32 networks. After recovering the architecture, NeuroUnlock automatically builds a near-equivalent DNN with only a 1.4% drop in the testing accuracy. We further show that launching a subsequent adversarial attack on the recovered DNNs boosts the success rate of the adversarial attack by 51.7% in average compared to launching it on the obfuscated versions. Additionally, we propose a novel methodology for DNN obfuscation, ReDLock, which eradicates the deterministic nature of the obfuscation and achieves 2.16X more resilience to the NeuroUnlock attack. We release the NeuroUnlock and the ReDLock as open-source frameworks.
    Learning-Augmented Algorithms for Online TSP on the Line. (arXiv:2206.00655v1 [cs.DS])
    We study the online Traveling Salesman Problem (TSP) on the line augmented with machine-learned predictions. In the classical problem, there is a stream of requests released over time along the real line. The goal is to minimize the makespan of the algorithm. We distinguish between the open variant and the closed one, in which we additionally require the algorithm to return to the origin after serving all requests. The state of the art is a $1.64$-competitive algorithm and a $2.04$-competitive algorithm for the closed and open variants, respectively \cite{Bjelde:1.64}. In both cases, a tight lower bound is known \cite{Ausiello:1.75, Bjelde:1.64}. In both variants, our primary prediction model involves predicted positions of the requests. We introduce algorithms that (i) obtain a tight 1.5 competitive ratio for the closed variant and a 1.66 competitive ratio for the open variant in the case of perfect predictions, (ii) are robust against unbounded prediction error, and (iii) are smooth, i.e., their performance degrades gracefully as the prediction error increases. Moreover, we further investigate the learning-augmented setting in the open variant by additionally considering a prediction for the last request served by the optimal offline algorithm. Our algorithm for this enhanced setting obtains a 1.33 competitive ratio with perfect predictions while also being smooth and robust, beating the lower bound of 1.44 we show for our original prediction setting for the open variant. Also, we provide a lower bound of 1.25 for this enhanced setting.
    High-Throughput Approach to Modeling Healthcare Costs Using Electronic Healthcare Records. (arXiv:2011.09497v2 [cs.LG] UPDATED)
    Accurate estimation of healthcare costs is crucial for healthcare systems to plan and effectively negotiate with insurance companies regarding the coverage of patient-care costs. Greater accuracy in estimating healthcare costs would provide mutual benefit for both health systems and the insurers that support these systems by better aligning payment models with patient-care costs. This study presents the results of a generalizable machine learning approach to predicting medical events built from 40 years of data from >860,000 patients pertaining to >6,700 prescription medications, courtesy of Marshfield Clinic in Wisconsin. It was found that models built using this approach performed well when compared to similar studies predicting physician prescriptions of individual medications. In addition to providing a comprehensive predictive model for all drugs in a large healthcare system, the approach taken in this research benefits from potential applicability to a wide variety of other medical events.
    Sparse Bayesian Deep Learning for Dynamic System Identification. (arXiv:2107.12910v2 [eess.SY] UPDATED)
    This paper proposes a sparse Bayesian treatment of deep neural networks (DNNs) for system identification. Although DNNs show impressive approximation ability in various fields, several challenges still exist for system identification problems. First, DNNs are known to be too complex that they can easily overfit the training data. Second, the selection of the input regressors for system identification is nontrivial. Third, uncertainty quantification of the model parameters and predictions are necessary. The proposed Bayesian approach offers a principled way to alleviate the above challenges by marginal likelihood/model evidence approximation and structured group sparsity-inducing priors construction. The identification algorithm is derived as an iterative regularised optimisation procedure that can be solved as efficiently as training typical DNNs. Remarkably, an efficient and recursive Hessian calculation method for each layer of DNNs is developed, turning the intractable training/optimisation process into a tractable one. Furthermore, a practical calculation approach based on the Monte-Carlo integration method is derived to quantify the uncertainty of the parameters and predictions. The effectiveness of the proposed Bayesian approach is demonstrated on several linear and nonlinear system identification benchmarks by achieving good and competitive simulation accuracy. The code to reproduce the experimental results is open-sourced and available online.
    On the Choice of Data for Efficient Training and Validation of End-to-End Driving Models. (arXiv:2206.00608v1 [cs.CV])
    The emergence of data-driven machine learning (ML) has facilitated significant progress in many complicated tasks such as highly-automated driving. While much effort is put into improving the ML models and learning algorithms in such applications, little focus is put into how the training data and/or validation setting should be designed. In this paper we investigate the influence of several data design choices regarding training and validation of deep driving models trainable in an end-to-end fashion. Specifically, (i) we investigate how the amount of training data influences the final driving performance, and which performance limitations are induced through currently used mechanisms to generate training data. (ii) Further, we show by correlation analysis, which validation design enables the driving performance measured during validation to generalize well to unknown test environments. (iii) Finally, we investigate the effect of random seeding and non-determinism, giving insights which reported improvements can be deemed significant. Our evaluations using the popular CARLA simulator provide recommendations regarding data generation and driving route selection for an efficient future development of end-to-end driving models.
    From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers. (arXiv:2107.07999v4 [cs.LG] UPDATED)
    In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.
    Algorithmic Fairness Verification with Graphical Models. (arXiv:2109.09447v2 [cs.LG] UPDATED)
    In recent years, machine learning (ML) algorithms have been deployed in safety-critical and high-stake decision-making, where the fairness of algorithms is of paramount importance. Fairness in ML centers on detecting bias towards certain demographic populations induced by an ML classifier and proposes algorithmic solutions to mitigate the bias with respect to different fairness definitions. To this end, several fairness verifiers have been proposed that compute the bias in the prediction of an ML classifier--essentially beyond a finite dataset--given the probability distribution of input features. In the context of verifying linear classifiers, existing fairness verifiers are limited by accuracy due to imprecise modeling of correlations among features and scalability due to restrictive formulations of the classifiers as SSAT/SMT formulas or by sampling. In this paper, we propose an efficient fairness verifier, called FVGM, that encodes the correlations among features as a Bayesian network. In contrast to existing verifiers, FVGM proposes a stochastic subset-sum based approach for verifying linear classifiers. Experimentally, we show that FVGM leads to an accurate and scalable assessment for more diverse families of fairness-enhancing algorithms, fairness attacks, and group/causal fairness metrics than the state-of-the-art fairness verifiers. We also demonstrate that FVGM facilitates the computation of fairness influence functions as a stepping stone to detect the source of bias induced by subsets of features.  ( 2 min )
    Physics-Informed Neural Nets for Control of Dynamical Systems. (arXiv:2104.02556v3 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) impose known physical laws into the learning of deep neural networks, making sure they respect the physics of the process while decreasing the demand of labeled data. For systems represented by Ordinary Differential Equations (ODEs), the conventional PINN has a continuous time input variable and outputs the solution of the corresponding ODE. In their original form, PINNs do not allow control inputs, neither can they simulate for variable long-range intervals without serious degradation in their predictions. In this context, this work presents a new framework called Physics-Informed Neural Nets for Control (PINC), which proposes a novel PINN-based architecture that is amenable to control problems and able to simulate for longer-range time horizons that are not fixed beforehand, making it a very flexible framework when compared to traditional PINNs. Furthermore, this long-range time simulation of differential equations is faster than numerical methods since it relies only on signal propagation through the network, making it less computationally costly and, thus, a better alternative for simulation of models in Model Predictive Control. We showcase our proposal in the control of two nonlinear dynamic systems: the Van der Pol oscillator and the four-tank system.  ( 2 min )
    Deep Learning Opacity in Scientific Discovery. (arXiv:2206.00520v1 [cs.AI])
    Philosophers have recently focused on critical, epistemological challenges that arise from the opacity of deep neural networks. One might conclude from this literature that doing good science with opaque models is exceptionally challenging, if not impossible. Yet, this is hard to square with the recent boom in optimism for AI in science alongside a flood of recent scientific breakthroughs driven by AI methods. In this paper, I argue that the disconnect between philosophical pessimism and scientific optimism is driven by a failure to examine how AI is actually used in science. I show that, in order to understand the epistemic justification for AI-powered breakthroughs, philosophers must examine the role played by deep learning as part of a wider process of discovery. The philosophical distinction between the 'context of discovery' and the 'context of justification' is helpful in this regard. I demonstrate the importance of attending to this distinction with two cases drawn from the scientific literature, and show that epistemic opacity need not diminish AI's capacity to lead scientists to significant and justifiable breakthroughs.  ( 2 min )
    Algorithmic Foundation of Deep X-Risk Optimization. (arXiv:2206.00439v1 [cs.LG])
    X-risk is a term introduced to represent a family of compositional measures or objectives, in which each data point is compared with a set of data points explicitly or implicitly for defining a risk function. It includes many widely used measures or objectives, e.g., AUROC, AUPRC, partial AUROC, NDCG, MAP, top-$K$ NDCG, top-$K$ MAP, listwise losses, p-norm push, top push, precision/recall at top $K$ positions, precision at a certain recall level, contrastive objectives, etc. While these measures/objectives and their optimization algorithms have been studied in the literature of machine learning, computer vision, information retrieval, and etc, optimizing these measures/objectives has encountered some unique challenges for deep learning. In this technical report, we survey our recent rigorous efforts for deep X-risk optimization (DXO) by focusing on its algorithmic foundation. We introduce a class of techniques for optimizing X-risk for deep learning. We formulate DXO into three special families of non-convex optimization problems belonging to non-convex min-max optimization, non-convex compositional optimization, and non-convex bilevel optimization, respectively. For each family of problems, we present some strong baseline algorithms and their complexities, which will motivate further research for improving the existing results. Discussions about the presented results and future studies are given at the end.  ( 2 min )
    Feature Selection for Discovering Distributional Treatment Effect Modifiers. (arXiv:2206.00516v1 [cs.LG])
    Finding the features relevant to the difference in treatment effects is essential to unveil the underlying causal mechanisms. Existing methods seek such features by measuring how greatly the feature attributes affect the degree of the {\it conditional average treatment effect} (CATE). However, these methods may overlook important features because CATE, a measure of the average treatment effect, cannot detect differences in distribution parameters other than the mean (e.g., variance). To resolve this weakness of existing methods, we propose a feature selection framework for discovering {\it distributional treatment effect modifiers}. We first formulate a feature importance measure that quantifies how strongly the feature attributes influence the discrepancy between potential outcome distributions. Then we derive its computationally efficient estimator and develop a feature selection algorithm that can control the type I error rate to the desired level. Experimental results show that our framework successfully discovers important features and outperforms the existing mean-based method.  ( 2 min )
    Support Vector Machines under Adversarial Label Contamination. (arXiv:2206.00352v1 [cs.LG])
    Machine learning algorithms are increasingly being applied in security-related tasks such as spam and malware detection, although their security properties against deliberate attacks have not yet been widely understood. Intelligent and adaptive attackers may indeed exploit specific vulnerabilities exposed by machine learning techniques to violate system security. Being robust to adversarial data manipulation is thus an important, additional requirement for machine learning algorithms to successfully operate in adversarial settings. In this work, we evaluate the security of Support Vector Machines (SVMs) to well-crafted, adversarial label noise attacks. In particular, we consider an attacker that aims to maximize the SVM's classification error by flipping a number of labels in the training data. We formalize a corresponding optimal attack strategy, and solve it by means of heuristic approaches to keep the computational complexity tractable. We report an extensive experimental analysis on the effectiveness of the considered attacks against linear and non-linear SVMs, both on synthetic and real-world datasets. We finally argue that our approach can also provide useful insights for developing more secure SVM learning algorithms, and also novel techniques in a number of related research areas, such as semi-supervised and active learning.  ( 2 min )
    Elucidating the Design Space of Diffusion-Based Generative Models. (arXiv:2206.00364v1 [cs.CV])
    We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.  ( 2 min )
    Neural Improvement Heuristics for Preference Ranking. (arXiv:2206.00383v1 [cs.AI])
    In recent years, Deep Learning based methods have been a revolution in the field of combinatorial optimization. They learn to approximate solutions and constitute an interesting choice when dealing with repetitive problems drawn from similar distributions. Most effort has been devoted to investigating neural constructive methods, while the works that propose neural models to iteratively improve a candidate solution are less frequent. In this paper, we present a Neural Improvement (NI) model for graph-based combinatorial problems that, given an instance and a candidate solution, encodes the problem information by means of edge features. Our model proposes a modification on the pairwise precedence of items to increase the quality of the solution. We demonstrate the practicality of the model by applying it as the building block of a Neural Hill Climber and other trajectory-based methods. The algorithms are used to solve the Preference Ranking Problem and results show that they outperform conventional alternatives in simulated and real-world data. Conducted experiments also reveal that the proposed model can be a milestone in the development of efficiently guided trajectory-based optimization algorithms.  ( 2 min )
    Byzantine-Robust Online and Offline Distributed Reinforcement Learning. (arXiv:2206.00165v1 [cs.LG])
    We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.  ( 2 min )
    Pre-training via Denoising for Molecular Property Prediction. (arXiv:2206.00133v1 [cs.LG])
    Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Inspired by recent advances in noise regularization, our pre-training objective is based on denoising. Relying on the well-known link between denoising autoencoders and score-matching, we also show that the objective corresponds to learning a molecular force field -- arising from approximating the physical state distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.  ( 2 min )
    Optimization with access to auxiliary information. (arXiv:2206.00395v1 [cs.LG])
    We investigate the fundamental optimization question of minimizing a target function $f(x)$ whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h(x)$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, etc. We propose two generic new algorithms which are applicable in all these settings and prove using only an assumption on the Hessian similarity between the target and side information that we can benefit from this framework.  ( 2 min )
    Hands-Up: Leveraging Synthetic Data for Hands-On-Wheel Detection. (arXiv:2206.00148v1 [cs.CV])
    Over the past few years there has been major progress in the field of synthetic data generation using simulation based techniques. These methods use high-end graphics engines and physics-based ray-tracing rendering in order to represent the world in 3D and create highly realistic images. Datagen has specialized in the generation of high-quality 3D humans, realistic 3D environments and generation of realistic human motion. This technology has been developed into a data generation platform which we used for these experiments. This work demonstrates the use of synthetic photo-realistic in-cabin data to train a Driver Monitoring System that uses a lightweight neural network to detect whether the driver's hands are on the wheel. We demonstrate that when only a small amount of real data is available, synthetic data can be a simple way to boost performance. Moreover, we adopt the data-centric approach and show how performing error analysis and generating the missing edge-cases in our platform boosts performance. This showcases the ability of human-centric synthetic data to generalize well to the real world, and help train algorithms in computer vision settings where data from the target domain is scarce or hard to collect.  ( 2 min )
    Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble. (arXiv:2206.00238v1 [cs.LG])
    Inverse reinforcement learning (IRL) recovers the underlying reward function from expert demonstrations. A generalizable reward function is even desired as it captures the fundamental motivation of the expert. However, classical IRL methods can only recover reward functions coupled with the training dynamics, thus are hard to generalize to a changed environment. Previous dynamics-agnostic reward learning methods have strict assumptions, such as that the reward function has to be state-only. This work proposes a general approach to learn transferable reward functions, Dynamics-Agnostic Discriminator-Ensemble Reward Learning (DARL). Following the adversarial imitation learning (AIL) framework, DARL learns a dynamics-agnostic discriminator on a latent space mapped from the original state-action space. The latent space is learned to contain the least information of the dynamics. Moreover, to reduce the reliance of the discriminator on policies, the reward function is represented as an ensemble of the discriminators during training. We assess DARL in four MuJoCo tasks with dynamics transfer. Empirical results compared with the state-of-the-art AIL methods show that DARL can learn a reward that is more consistent with the true reward, thus obtaining higher environment returns.  ( 2 min )
    A Survey on Deep Learning for Skin Lesion Segmentation. (arXiv:2206.00356v1 [eess.IV])
    Skin cancer is a major public health problem that could benefit from computer-aided diagnosis to reduce the burden of this common disease. Skin lesion segmentation from images is an important step toward achieving this goal. However, the presence of natural and artificial artifacts (e.g., hair and air bubbles), intrinsic factors (e.g., lesion shape and contrast), and variations in image acquisition conditions make skin lesion segmentation a challenging task. Recently, various researchers have explored the applicability of deep learning models to skin lesion segmentation. In this survey, we cross-examine 134 research papers that deal with deep learning based segmentation of skin lesions. We analyze these works along several dimensions, including input data (datasets, preprocessing, and synthetic data generation), model design (architecture, modules, and losses), and evaluation aspects (data annotation requirements and segmentation performance). We discuss these dimensions both from the viewpoint of select seminal works, and from a systematic viewpoint, examining how those choices have influenced current trends, and how their limitations should be addressed. We summarize all examined works in a comprehensive table to facilitate comparisons.  ( 2 min )
    Decompositional Generation Process for Instance-Dependent Partial Label Learning. (arXiv:2204.03845v2 [cs.LG] UPDATED)
    Partial label learning (PLL) is a typical weakly supervised learning problem, where each training example is associated with a set of candidate labels among which only one is true. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels and model the generation process of the candidate labels in a simple way. However, these approaches usually do not perform as well as expected due to the fact that the generation process of the candidate labels is always instance-dependent. Therefore, it deserves to be modeled in a refined way. In this paper, we consider instance-dependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts, where the correct label emerges first in the mind of the annotator but then the incorrect labels related to the feature are also selected with the correct label as candidate labels due to uncertainty of labeling. Motivated by this consideration, we propose a novel PLL method that performs Maximum A Posterior(MAP) based on an explicitly modeled generation process of candidate labels via decomposed probability distribution models. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed method.
    To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier. (arXiv:2206.00074v1 [stat.ML])
    Algorithmic fairness has emerged as an important consideration when using machine learning to make high-stakes societal decisions. Yet, improved fairness often comes at the expense of model accuracy. While aspects of the fairness-accuracy tradeoff have been studied, most work reports the fairness and accuracy of various models separately; this makes model comparisons nearly impossible without a model-agnostic metric that reflects the balance of the two desiderata. We seek to identify, quantify, and optimize the empirical Pareto frontier of the fairness-accuracy tradeoff. Specifically, we identify and outline the empirical Pareto frontier through Tradeoff-between-Fairness-and-Accuracy (TAF) Curves; we then develop a metric to quantify this Pareto frontier through the weighted area under the TAF Curve which we term the Fairness-Area-Under-the-Curve (FAUC). TAF Curves provide the first empirical, model-agnostic characterization of the Pareto frontier, while FAUC provides the first metric to impartially compare model families on both fairness and accuracy. Both TAF Curves and FAUC can be employed with all group fairness definitions and accuracy measures. Next, we ask: Is it possible to expand the empirical Pareto frontier and thus improve the FAUC for a given collection of fitted models? We answer affirmately by developing a novel fair model stacking framework, FairStacks, that solves a convex program to maximize the accuracy of model ensemble subject to a score-bias constraint. We show that optimizing with FairStacks always expands the empirical Pareto frontier and improves the FAUC; we additionally study other theoretical properties of our proposed approach. Finally, we empirically validate TAF, FAUC, and FairStacks through studies on several real benchmark data sets, showing that FairStacks leads to major improvements in FAUC that outperform existing algorithmic fairness approaches.
    Exploratory Methods for Relation Discovery in Archival Data. (arXiv:2202.11361v2 [cs.LG] UPDATED)
    In this article we propose a holistic approach to discover relations in art historical communities and enrich historians' biographies and archival descriptions with graph patterns relevant to art historiographic enquiry. We use exploratory data analysis to detect patterns, we select features, and we use them to evaluate classification models to predict new relations, to be recommended to archivists during the cataloguing phase. Results show that relations based on biographical information can be addressed with higher precision than relations based on research topics or institutional relations. Deterministic and a priori rules present better results than probabilistic methods.
    Generative Modeling Helps Weak Supervision (and Vice Versa). (arXiv:2203.12023v3 [cs.LG] UPDATED)
    Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.
    ForestPrune: Compact Depth-Controlled Tree Ensembles. (arXiv:2206.00128v1 [stat.ML])
    Tree ensembles are versatile supervised learning algorithms that achieve state-of-the-art performance. These models are extremely powerful but can grow to enormous sizes. As a result, tree ensembles are often post-processed to reduce memory footprint and improve interpretability. In this paper, we present ForestPrune, a novel optimization framework that can post-process tree ensembles by pruning depth layers from individual trees. We also develop a new block coordinate descent method to efficiently obtain high-quality solutions to optimization problems under this framework. The number of nodes in a decision tree increases exponentially with tree depth, so pruning deep trees can drastically improve model parsimony. ForestPrune can substantially reduce the space complexity of an ensemble for a minimal cost to performance. The framework supports various weighting schemes and contains just a single hyperparameter to tune. In our experiments, we observe that ForestPrune can reduce model size 20-fold with negligible performance loss.
    FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems. (arXiv:2206.00065v1 [cs.DC])
    Edge computing enables smart IoT-based systems via concurrent and continuous execution of latency-sensitive machine learning (ML) applications. These edge-based machine learning systems are often battery-powered (i.e., energy-limited). They use heterogeneous resources with diverse computing performance (e.g., CPU, GPU, and/or FPGAs) to fulfill the latency constraints of ML applications. The challenge is to allocate user requests for different ML applications on the Heterogeneous Edge Computing Systems (HEC) with respect to both the energy and latency constraints of these systems. To this end, we study and analyze resource allocation solutions that can increase the on-time task completion rate while considering the energy constraint. Importantly, we investigate edge-friendly (lightweight) multi-objective mapping heuristics that do not become biased toward a particular application type to achieve the objectives; instead, the heuristics consider "fairness" across the concurrent ML applications in their mapping decisions. Performance evaluations demonstrate that the proposed heuristic outperforms widely-used heuristics in heterogeneous systems in terms of the latency and energy objectives, particularly, at low to moderate request arrival rates. We observed 8.9% improvement in on-time task completion rate and 12.6% in energy-saving without imposing any significant overhead on the edge system.
    Star algorithm for NN ensembling. (arXiv:2206.00255v1 [cs.LG])
    Neural network ensembling is a common and robust way to increase model efficiency. In this paper, we propose a new neural network ensemble algorithm based on Audibert's empirical star algorithm. We provide optimal theoretical minimax bound on the excess squared risk. Additionally, we empirically study this algorithm on regression and classification tasks and compare it to most popular ensembling methods.
    An Indirect Rate-Distortion Characterization for Semantic Sources: General Model and the Case of Gaussian Observation. (arXiv:2201.12477v2 [cs.IT] UPDATED)
    A new source model, which consists of an intrinsic state part and an extrinsic observation part, is proposed and its information-theoretic characterization, namely its rate-distortion function, is defined and analyzed. Such a source model is motivated by the recent surge of interest in the semantic aspect of information: the intrinsic state corresponds to the semantic feature of the source, which in general is not observable but can only be inferred from the extrinsic observation. There are two distortion measures, one between the intrinsic state and its reproduction, and the other between the extrinsic observation and its reproduction. Under a given code rate, the tradeoff between these two distortion measures is characterized by the rate-distortion function, which is solved via the indirect rate-distortion theory and is termed as the semantic rate-distortion function of the source. As an application of the general model and its analysis, the case of Gaussian extrinsic observation is studied, assuming a linear relationship between the intrinsic state and the extrinsic observation, under a quadratic distortion structure. The semantic rate-distortion function is shown to be the solution of a convex programming programming problem with respect to an error covariance matrix, and a reverse water-filling type of solution is provided when the model further satisfies a diagonalizability condition.
    RMT-Net: Reject-aware Multi-Task Network for Modeling Missing-not-at-random Data in Financial Credit Scoring. (arXiv:2206.00568v1 [cs.LG])
    In financial credit scoring, loan applications may be approved or rejected. We can only observe default/non-default labels for approved samples but have no observations for rejected samples, which leads to missing-not-at-random selection bias. Machine learning models trained on such biased data are inevitably unreliable. In this work, we find that the default/non-default classification task and the rejection/approval classification task are highly correlated, according to both real-world data study and theoretical analysis. Consequently, the learning of default/non-default can benefit from rejection/approval. Accordingly, we for the first time propose to model the biased credit scoring data with Multi-Task Learning (MTL). Specifically, we propose a novel Reject-aware Multi-Task Network (RMT-Net), which learns the task weights that control the information sharing from the rejection/approval task to the default/non-default task by a gating network based on rejection probabilities. RMT-Net leverages the relation between the two tasks that the larger the rejection probability, the more the default/non-default task needs to learn from the rejection/approval task. Furthermore, we extend RMT-Net to RMT-Net++ for modeling scenarios with multiple rejection/approval strategies. Extensive experiments are conducted on several datasets, and strongly verifies the effectiveness of RMT-Net on both approved and rejected samples. In addition, RMT-Net++ further improves RMT-Net's performances.  ( 2 min )
    One Positive Label is Sufficient: Single-Positive Multi-Label Learning with Label Enhancement. (arXiv:2206.00517v1 [cs.LG])
    Multi-label learning (MLL) learns from the examples each associated with multiple labels simultaneously, where the high cost of annotating all relevant labels for each training example is challenging for real-world applications. To cope with the challenge, we investigate single-positive multi-label learning (SPMLL) where each example is annotated with only one relevant label and show that one can successfully learn a theoretically grounded multi-label classifier for the problem. In this paper, a novel SPMLL method named {\proposed}, i.e., Single-positive MultI-label learning with Label Enhancement, is proposed. Specifically, an unbiased risk estimator is derived, which could be guaranteed to approximately converge to the optimal risk minimizer of fully supervised learning and shows that one positive label of each instance is sufficient to train the predictive model. Then, the corresponding empirical risk estimator is established via recovering the latent soft label as a label enhancement process, where the posterior density of the latent soft labels is approximate to the variational Beta density parameterized by an inference model. Experiments on benchmark datasets validate the effectiveness of the proposed method.
    Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training. (arXiv:2206.00621v1 [cs.CL])
    In this paper, we introduce Cross-View Language Modeling, a simple and effective language model pre-training framework that unifies cross-lingual cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Notably, CCLM is the first multi-lingual multi-modal model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.  ( 2 min )
    Attention-embedded Quadratic Network (Qttention) for Effective and Interpretable Bearing Fault Diagnosis. (arXiv:2206.00390v1 [cs.LG])
    Bearing fault diagnosis is of great importance to decrease the damage risk of rotating machines and further improve economic profits. Recently, machine learning, represented by deep learning, has made great progress in bearing fault diagnosis. However, applying deep learning to such a task still faces two major problems. On the one hand, deep learning loses its effectiveness when bearing data are noisy or big data are unavailable, making deep learning hard to implement in industrial fields. On the other hand, a deep network is notoriously a black box. It is difficult to know how a model classifies faulty signals from the normal and the physics principle behind the classification. To solve the effectiveness and interpretability issues, we prototype a convolutional network with recently-invented quadratic neurons. This quadratic neuron empowered network can qualify the noisy and small bearing data due to the strong feature representation ability of quadratic neurons. Moreover, we independently derive the attention mechanism from a quadratic neuron, referred to as qttention, by factorizing the learned quadratic function in analogue to the attention, making the model with quadratic neurons inherently interpretable. Experiments on the public and our datasets demonstrate that the proposed network can facilitate effective and interpretable bearing fault diagnosis.
    The robust way to stack and bag: the local Lipschitz way. (arXiv:2206.00513v1 [cs.LG])
    Recent research has established that the local Lipschitz constant of a neural network directly influences its adversarial robustness. We exploit this relationship to construct an ensemble of neural networks which not only improves the accuracy, but also provides increased adversarial robustness. The local Lipschitz constants for two different ensemble methods - bagging and stacking - are derived and the architectures best suited for ensuring adversarial robustness are deduced. The proposed ensemble architectures are tested on MNIST and CIFAR-10 datasets in the presence of white-box attacks, FGSM and PGD. The proposed architecture is found to be more robust than a) a single network and b) traditional ensemble methods.  ( 2 min )
    IDANI: Inference-time Domain Adaptation via Neuron-level Interventions. (arXiv:2206.00259v1 [cs.CL])
    Large pre-trained models are usually fine-tuned on downstream task data, and tested on unseen data. When the train and test data come from different domains, the model is likely to struggle, as it is not adapted to the test domain. We propose a new approach for domain adaptation (DA), using neuron-level interventions: We modify the representation of each test example in specific neurons, resulting in a counterfactual example from the source domain, which the model is more familiar with. The modified example is then fed back into the model. While most other DA methods are applied during training time, ours is applied during inference only, making it more efficient and applicable. Our experiments show that our method improves performance on unseen domains.  ( 2 min )
    Optimisation of Structured Neural Controller Based on Continuous-Time Policy Gradient. (arXiv:2201.06262v3 [cs.LG] UPDATED)
    This study presents a policy optimisation framework for structured nonlinear control of continuous-time (deterministic) dynamic systems. The proposed approach prescribes a structure for the controller based on relevant scientific knowledge (such as Lyapunov stability theory or domain experiences) while considering the tunable elements inside the given structure as the point of parametrisation with neural networks. To optimise a cost represented as a function of the neural network weights, the proposed approach utilises the continuous-time policy gradient method based on adjoint sensitivity analysis as a means for correct and performant computation of cost gradient. This enables combining the stability, robustness, and physical interpretability of an analytically-derived structure for the feedback controller with the representational flexibility and optimised resulting performance provided by machine learning techniques. Such a hybrid paradigm for fixed-structure control synthesis is particularly useful for optimising adaptive nonlinear controllers to achieve improved performance in online operation, an area where the existing theory prevails the design of structure while lacking clear analytical understandings about tuning of the gains and the uncertainty model basis functions that govern the performance characteristics. Numerical experiments on aerospace applications illustrate the utility of the structured nonlinear controller optimisation framework.
    Transfer without Forgetting. (arXiv:2206.00388v1 [cs.LG])
    This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the widespread application of network pretraining, highlighting that it is itself subject to catastrophic forgetting. Unfortunately, this issue leads to the under-exploitation of knowledge transfer during later tasks. On this ground, we propose Transfer without Forgetting (TwF), a hybrid Continual Transfer Learning approach building upon a fixed pretrained sibling network, which continuously propagates the knowledge inherent in the source domain through a layer-wise loss term. Our experiments indicate that TwF steadily outperforms other CL methods across a variety of settings, averaging a 4.81% gain in Class-Incremental accuracy over a variety of datasets and different buffer sizes.  ( 2 min )
    DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems. (arXiv:2206.00484v1 [cs.RO])
    Muscle-actuated organisms are capable of learning an unparalleled diversity of dexterous movements despite their vast amount of muscles. Reinforcement learning (RL) on large musculoskeletal models, however, has not been able to show similar performance. We conjecture that ineffective exploration in large overactuated action spaces is a key problem. This is supported by the finding that common exploration noise strategies are inadequate in synthetic examples of overactuated systems. We identify differential extrinsic plasticity (DEP), a method from the domain of self-organization, as being able to induce state-space covering exploration within seconds of interaction. By integrating DEP into RL, we achieve fast learning of reaching and locomotion in musculoskeletal systems, outperforming current approaches in all considered tasks in sample efficiency and robustness.  ( 2 min )
    Operational Adaptation of DNN Classifiers using Elastic Weight Consolidation. (arXiv:2205.00147v2 [cs.LG] UPDATED)
    Autonomous systems (AS) often use Deep Neural Network (DNN) classifiers to allow them to operate in complex, high dimensional, non-linear, and dynamically changing environments. Due to the complexity of these environments, DNN classifiers may output misclassifications as they experience tasks in their operational environments, that were not identified during development. Removing a system from operation and retraining it to include these new tasks becomes economically infeasible as the number of such ASs increases. Additionally, such misclassifications may cause financial loss and safety threats to the AS or to other operators in the environment. In this paper, we propose to reduce such threats by investigating how DNN classifiers can adapt their knowledge to learn new information in the AS's operational environment, using only a limited number of observations encountered sequentially during operation. This allows the AS to adapt to newly encountered information, increasing the AS's classification accuracy and hence its overall reliability. However, retraining DNNs on different observations than used in prior training is known to cause catastrophic forgetting or significant model drift. We investigate how this problem can be controlled by using Elastic Weight Consolidation (EWC) whilst learning from limited new observations. We carry out experiments using original and noisy versions of the MNIST dataset to represent known and new information to DNN classifiers. Results show that using EWC is effective in controlling the process of adaptation to new information, and thus allows for reliable adaption of ASs to new information in their operational environment.  ( 2 min )
    In the Eye of the Beholder: Robust Prediction with Causal User Modeling. (arXiv:2206.00416v1 [cs.LG])
    Accurately predicting the relevance of items to users is crucial to the success of many social platforms. Conventional approaches train models on logged historical data; but recommendation systems, media services, and online marketplaces all exhibit a constant influx of new content -- making relevancy a moving target, to which standard predictive models are not robust. In this paper, we propose a learning framework for relevance prediction that is robust to changes in the data distribution. Our key observation is that robustness can be obtained by accounting for how users causally perceive the environment. We model users as boundedly-rational decision makers whose causal beliefs are encoded by a causal graph, and show how minimal information regarding the graph can be used to contend with distributional changes. Experiments in multiple settings demonstrate the effectiveness of our approach.
    Calibrated Bagging Deep Learning for Image Semantic Segmentation: A Case Study on COVID-19 Chest X-ray Image. (arXiv:2206.00002v1 [eess.IV])
    Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes coronavirus disease 2019 (COVID-19). Imaging tests such as chest X-ray (CXR) and computed tomography (CT) can provide useful information to clinical staff for facilitating a diagnosis of COVID-19 in a more efficient and comprehensive manner. As a breakthrough of artificial intelligence (AI), deep learning has been applied to perform COVID-19 infection region segmentation and disease classification by analyzing CXR and CT data. However, prediction uncertainty of deep learning models for these tasks, which is very important to safety-critical applications like medical image processing, has not been comprehensively investigated. In this work, we propose a novel ensemble deep learning model through integrating bagging deep learning and model calibration to not only enhance segmentation performance, but also reduce prediction uncertainty. The proposed method has been validated on a large dataset that is associated with CXR image segmentation. Experimental results demonstrate that the proposed method can improve the segmentation performance, as well as decrease prediction uncertainties.  ( 2 min )
    Good Intentions: Adaptive Parameter Servers via Intent Signaling. (arXiv:2206.00470v1 [cs.LG])
    Parameter servers (PSs) ease the implementation of distributed training for large machine learning (ML) tasks by providing primitives for shared parameter access. Especially for ML tasks that access parameters sparsely, PSs can achieve high efficiency and scalability. To do so, they employ a number of techniques -- such as replication or relocation -- to reduce communication cost and/or latency of parameter accesses. A suitable choice and parameterization of these techniques is crucial to realize these gains, however. Unfortunately, such choices depend on the task, the workload, and even individual parameters, they often require expensive upfront experimentation, and they are susceptible to workload changes. In this paper, we explore whether PSs can automatically adapt to the workload without any prior tuning. Our goals are to improve usability and to maintain (or even improve) efficiency. We propose (i) a novel intent signaling mechanism that acts as an enabler for adaptivity and naturally integrates into ML tasks, and (ii) a fully adaptive, zero-tuning PS called AdaPS based on this mechanism. Our experimental evaluation suggests that automatic adaptation to the workload is indeed possible: AdaPS matched or outperformed state-of-the-art PSs out of the box.
    Differentiable programming for functional connectomics. (arXiv:2206.00649v1 [q-bio.NC])
    Mapping the functional connectome has the potential to uncover key insights into brain organisation. However, existing workflows for functional connectomics are limited in their adaptability to new data, and principled workflow design is a challenging combinatorial problem. We introduce a new analytic paradigm and software toolbox that implements common operations used in functional connectomics as fully differentiable processing blocks. Under this paradigm, workflow configurations exist as reparameterisations of a differentiable functional that interpolates them. The differentiable program that we envision occupies a niche midway between traditional pipelines and end-to-end neural networks, combining the glass-box tractability and domain knowledge of the former with the amenability to optimisation of the latter. In this preliminary work, we provide a proof of concept for differentiable connectomics, demonstrating the capacity of our processing blocks both to recapitulate canonical knowledge in neuroscience and to make new discoveries in an unsupervised setting. Our differentiable modules are competitive with state-of-the-art methods in problem domains including functional parcellation, denoising, and covariance modelling. Taken together, our results and software demonstrate the promise of differentiable programming for functional connectomics.
    Normalization effects on shallow neural networks and related asymptotic expansions. (arXiv:2011.10487v3 [stat.ML] UPDATED)
    We consider shallow (single hidden layer) neural networks and characterize their performance when trained with stochastic gradient descent as the number of hidden units $N$ and gradient descent steps grow to infinity. In particular, we investigate the effect of different scaling schemes, which lead to different normalizations of the neural network, on the network's statistical output, closing the gap between the $1/\sqrt{N}$ and the mean-field $1/N$ normalization. We develop an asymptotic expansion for the neural network's statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity. Based on this expansion, we demonstrate mathematically that to leading order in $N$, there is no bias-variance trade off, in that both bias and variance (both explicitly characterized) decrease as the number of hidden units increases and time grows. In addition, we show that to leading order in $N$, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization. Numerical studies on the MNIST and CIFAR10 datasets show that test and train accuracy monotonically improve as the neural network's normalization gets closer to the mean field normalization.
    A Hybrid Architecture for Federated and Centralized Learning. (arXiv:2105.03288v3 [cs.LG] UPDATED)
    Many of the machine learning tasks rely on centralized learning (CL), which requires the transmission of local datasets from the clients to a parameter server (PS) entailing huge communication overhead. To overcome this, federated learning (FL) has been suggested as a promising tool, wherein the clients send only the model updates to the PS instead of the whole dataset. However, FL demands powerful computational resources from the clients. In practice, not all the clients have sufficient computational resources to participate in training. To address this common scenario, we propose a more efficient approach called hybrid federated and centralized learning (HFCL), wherein only the clients with sufficient resources employ FL, while the remaining ones send their datasets to the PS, which computes the model on behalf of them. Then, the model parameters are aggregated at the PS. To improve the efficiency of dataset transmission, we propose two different techniques: i) increased computation-per-client and ii) sequential data transmission. Notably, the HFCL frameworks outperform FL with up to 20\% improvement in the learning accuracy when only half of the clients perform FL while having 50\% less communication overhead than CL since all the clients collaborate on the learning process with their datasets.
    Trap of Feature Diversity in the Learning of MLPs. (arXiv:2112.00980v4 [cs.LG] UPDATED)
    In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs), and we aim to explain the reason for the decrease of feature diversity in the first phase. Specifically, people find that, in the training of MLPs, the training loss does not decrease significantly until the second phase. To this end, we further explore the reason why the diversity of features over different samples keeps decreasing in the first phase, which hurts the optimization of MLPs. We explain such a phenomenon in terms of the learning dynamics of MLPs. Furthermore, we theoretically explain why four typical operations can alleviate the decrease of the feature diversity.  ( 2 min )
    Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization. (arXiv:2206.00363v1 [cs.LG])
    We study differentially private (DP) algorithms for smooth stochastic minimax optimization, with stochastic minimization as a byproduct. The holy grail of these settings is to guarantee the optimal trade-off between the privacy and the excess population loss, using an algorithm with a linear time-complexity in the number of training samples. We provide a general framework for solving differentially private stochastic minimax optimization (DP-SMO) problems, which enables the practitioners to bring their own base optimization algorithm and use it as a black-box to obtain the near-optimal privacy-loss trade-off. Our framework is inspired from the recently proposed Phased-ERM method [20] for nonsmooth differentially private stochastic convex optimization (DP-SCO), which exploits the stability of the empirical risk minimization (ERM) for the privacy guarantee. The flexibility of our approach enables us to sidestep the requirement that the base algorithm needs to have bounded sensitivity, and allows the use of sophisticated variance-reduced accelerated methods to achieve near-linear time-complexity. To the best of our knowledge, these are the first linear-time optimal algorithms, up to logarithmic factors, for smooth DP-SMO when the objective is (strongly-)convex-(strongly-)concave. Additionally, based on our flexible framework, we derive a new family of near-linear time algorithms for smooth DP-SCO with optimal privacy-loss trade-offs for a wider range of smoothness parameters compared to previous algorithms.
    Adaptive Online Learning of Quantum States. (arXiv:2206.00220v1 [cs.LG])
    In the fundamental problem of shadow tomography, the goal is to efficiently learn an unknown $d$-dimensional quantum state using projective measurements. However, it is rarely the case that the underlying state remains stationary: changes may occur due to measurements, environmental noise, or an underlying Hamiltonian state evolution. In this paper we adopt tools from adaptive online learning to learn a changing state, giving adaptive and dynamic regret bounds for online shadow tomography that are polynomial in the number of qubits and sublinear in the number of measurements. Our analysis utilizes tools from complex matrix analysis to cope with complex numbers, which may be of independent interest in online learning. In addition, we provide numerical experiments that corroborate our theoretical results.
    Predecessor Features. (arXiv:2206.00303v1 [cs.LG])
    Any reinforcement learning system must be able to identify which past events contributed to observed outcomes, a problem known as credit assignment. A common solution to this problem is to use an eligibility trace to assign credit to recency-weighted set of experienced events. However, in many realistic tasks, the set of recently experienced events are only one of the many possible action events that could have preceded the current outcome. This suggests that reinforcement learning can be made more efficient by allowing credit assignment to any viable preceding state, rather than only those most recently experienced. Accordingly, we propose "Predecessor Features", an algorithm that achieves this richer form of credit assignment. By maintaining a representation that approximates the expected sum of past occupancies, our algorithm allows temporal difference (TD) errors to be propagated accurately to a larger number of predecessor states than conventional methods, greatly improving learning speed. Our algorithm can also be naturally extended from tabular state representation to feature representations allowing for increased performance on a wide range of environments. We demonstrate several use cases for Predecessor Features and contrast its performance with other similar approaches.
    Reinforcement Learning with Algorithms from Probabilistic Structure Estimation. (arXiv:2103.08241v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms aim to learn optimal decisions in unknown environments through experience of taking actions and observing the rewards gained. In some cases, the environment is not influenced by the actions of the RL agent, in which case the problem can be modeled as a contextual multi-armed bandit and lightweight myopic algorithms can be employed. On the other hand, when the RL agent's actions affect the environment, the problem must be modeled as a Markov decision process and more complex RL algorithms are required which take the future effects of actions into account. Moreover, in practice, it is often unknown from the outset whether or not the agent's actions will impact the environment and it is therefore not possible to determine which RL algorithm is most fitting. In this work, we propose to avoid this difficult decision entirely and incorporate a choice mechanism into our RL framework. Rather than assuming a specific problem structure, we use a probabilistic structure estimation procedure based on a likelihood-ratio (LR) test to make a more informed selection of learning algorithm. We derive a sufficient condition under which myopic policies are optimal, present an LR test for this condition, and derive a bound on the regret of our framework. We provide examples of real-world scenarios where our framework is needed and provide extensive simulations to validate our approach.  ( 2 min )
    Augmenting Message Passing by Retrieving Similar Graphs. (arXiv:2206.00362v1 [cs.LG])
    Graph Neural Networks (GNNs) are effective tools for graph representation learning. Most GNNs rely on a recursive neighborhood aggregation scheme, named message passing. In this paper, motivated by the success of retrieval-based models, we propose a non-parametric scheme called GraphRetrieval, in which similar training graphs associated with their ground-truth labels are retrieved to be jointly utilized with the input graph representation to complete various graph-based predictive tasks. In particular, we take a well-trained model with its parameters fixed and then we add an adapter based on self-attention with only a few trainable parameters per task to explicitly learn the interaction between an input graph and its retrieved similar graphs. Our experiments on 12 different datasets involving different tasks (classification and regression) show that GraphRetrieval is able to achieve substantial improvements on all twelve datasets compared to three strong GNN baseline models. Our work demonstrates that GraphRetrieval is a promising augmentation for message passing.
    Towards Generalisable Audio Representations for Audio-Visual Navigation. (arXiv:2206.00393v1 [cs.SD])
    In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments based on its audio and visual perceptions. While existing methods attempt to improve the navigation performance with preciously designed path planning or intricate task settings, none has improved the model generalisation on unheard sounds with task settings unchanged. We thus propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder, where the sound-agnostic goal-driven latent representations can be learnt from various audio signals of different classes. In addition, we consider two data augmentation strategies to enrich the training sounds. We demonstrate that our designs can be easily equipped to existing AVN frameworks to obtain an immediate performance gain (13.4%$\uparrow$ in SPL on Replica and 12.2%$\uparrow$ in SPL on MP3D). Our project is available at https://AV-GeN.github.io/.  ( 2 min )
    Ultrahyperbolic Knowledge Graph Embeddings. (arXiv:2206.00449v1 [cs.LG])
    Recent knowledge graph (KG) embeddings have been advanced by hyperbolic geometry due to its superior capability for representing hierarchies. The topological structures of real-world KGs, however, are rather heterogeneous, i.e., a KG is composed of multiple distinct hierarchies and non-hierarchical graph structures. Therefore, a homogeneous (either Euclidean or hyperbolic) geometry is not sufficient for fairly representing such heterogeneous structures. To capture the topological heterogeneity of KGs, we present an ultrahyperbolic KG embedding (UltraE) in an ultrahyperbolic (or pseudo-Riemannian) manifold that seamlessly interleaves hyperbolic and spherical manifolds. In particular, we model each relation as a pseudo-orthogonal transformation that preserves the pseudo-Riemannian bilinear form. The pseudo-orthogonal transformation is decomposed into various operators (i.e., circular rotations, reflections and hyperbolic rotations), allowing for simultaneously modeling heterogeneous structures as well as complex relational patterns. Experimental results on three standard KGs show that UltraE outperforms previous Euclidean- and hyperbolic-based approaches.  ( 2 min )
    Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning. (arXiv:2206.00518v1 [cs.LG])
    In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of the prior due to the non-stationary nature of RL. These observations suggest two extreme schedules of distillation: (i) over the entire training; or (ii) only at the end. Hence, we devise a stand-alone network distillation method to inject the consistency prior at any time (even after RL), and a simple yet efficient framework to automatically schedule the distillation. Specifically, the proposed framework first focuses on mastering train environments regardless of generalization by adaptively deciding which {\it or no} augmentation to be used for the training. After this, we add the distillation to extract the remaining benefits for generalization from all the augmentations, which requires no additional new samples. In our experiments, we demonstrate the utility of the proposed framework, in particular, that considers postponing the augmentation to the end of RL training.  ( 2 min )
    Convolutional-Recurrent Neural Network Proxy for Robust Optimization and Closed-Loop Reservoir Management. (arXiv:2203.07524v2 [cs.LG] UPDATED)
    Production optimization under geological uncertainty is computationally expensive, as a large number of well control schedules must be evaluated over multiple geological realizations. In this work, a convolutional-recurrent neural network (CNN-RNN) proxy model is developed to predict well-by-well oil and water rates, for given time-varying well bottom-hole pressure (BHP) schedules, for each realization in an ensemble. This capability enables the estimation of the objective function and nonlinear constraint values required for robust optimization. The proxy model represents an extension of a recently developed long short-term memory (LSTM) RNN proxy designed to predict well rates for a single geomodel. A CNN is introduced here to processes permeability realizations, and this provides the initial states for the RNN. The CNN-RNN proxy is trained using simulation results for 300 different sets of BHP schedules and permeability realizations. We demonstrate proxy accuracy for oil-water flow through multiple realizations of 3D multi-Gaussian permeability models. The proxy is then incorporated into a closed-loop reservoir management (CLRM) workflow, where it is used with particle swarm optimization and a filter-based method for nonlinear constraint satisfaction. History matching is achieved using an adjoint-gradient-based procedure. The proxy model is shown to perform well in this setting for five different (synthetic) `true' models. Improved net present value along with constraint satisfaction and uncertainty reduction are observed with CLRM. For the robust production optimization steps, the proxy provides O(100) runtime speedup over simulation-based optimization.  ( 2 min )
    Width is Less Important than Depth in ReLU Neural Networks. (arXiv:2202.03841v2 [cs.LG] UPDATED)
    We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.  ( 2 min )
    Task-Specific Expert Pruning for Sparse Mixture-of-Experts. (arXiv:2206.00277v1 [cs.LG])
    The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile environment. The inference of MoE requires expert parallelism, which is not hardware-friendly and communication expensive. Especially for resource-limited downstream tasks, such sparse structure has to sacrifice a lot of computing efficiency for limited performance gains. In this work, we observe most experts contribute scarcely little to the MoE fine-tuning and inference. We further propose a general method to progressively drop the non-professional experts for the target downstream task, which preserves the benefits of MoE while reducing the MoE model into one single-expert dense model. Our experiments reveal that the fine-tuned single-expert model could preserve 99.3% benefits from MoE across six different types of tasks while enjoying 2x inference speed with free communication cost.  ( 2 min )
    Adversarial Attacks on Gaussian Process Bandits. (arXiv:2110.08449v2 [stat.ML] UPDATED)
    Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. This paper studies this problem from the attacker's perspective, proposing various adversarial attack methods with differing assumptions on the attacker's strength and prior information. Our goal is to understand adversarial attacks on GP bandits from theoretical and practical perspectives. We focus primarily on targeted attacks on the popular GP-UCB algorithm and a related elimination-based algorithm, based on adversarially perturbing the function $f$ to produce another function $\tilde{f}$ whose optima are in some target region $\mathcal{R}_{\rm target}$. Based on our theoretical analysis, we devise both white-box attacks (known $f$) and black-box attacks (unknown $f$), with the former including a Subtraction attack and Clipping attack, and the latter including an Aggressive subtraction attack. We demonstrate that adversarial attacks on GP bandits can succeed in forcing the algorithm towards $\mathcal{R}_{\rm target}$ even with a low attack budget, and we test our attacks' effectiveness on a diverse range of objective functions.  ( 2 min )
    An $\alpha$-No-Regret Algorithm For Graphical Bilinear Bandits. (arXiv:2206.00466v1 [cs.LG])
    We propose the first regret-based approach to the Graphical Bilinear Bandits problem, where $n$ agents in a graph play a stochastic bilinear bandit game with each of their neighbors. This setting reveals a combinatorial NP-hard problem that prevents the use of any existing regret-based algorithm in the (bi-)linear bandit literature. In this paper, we fill this gap and present the first regret-based algorithm for graphical bilinear bandits using the principle of optimism in the face of uncertainty. Theoretical analysis of this new method yields an upper bound of $\tilde{O}(\sqrt{T})$ on the $\alpha$-regret and evidences the impact of the graph structure on the rate of convergence. Finally, we show through various experiments the validity of our approach.  ( 2 min )
    Learning a performance metric of Buchberger's algorithm. (arXiv:2106.03676v2 [math.AC] UPDATED)
    What can be (machine) learned about the complexity of Buchberger's algorithm? Given a system of polynomials, Buchberger's algorithm computes a Gr\"obner basis of the ideal these polynomials generate using an iterative procedure based on multivariate long division. The runtime of each step of the algorithm is typically dominated by a series of polynomial additions, and the total number of these additions is a hardware independent performance metric that is often used to evaluate and optimize various implementation choices. In this work we attempt to predict, using just the starting input, the number of polynomial additions that take place during one run of Buchberger's algorithm. Good predictions are useful for quickly estimating difficulty and understanding what features make Gr\"obner basis computation hard. Our features and methods could also be used for value models in the reinforcement learning approach to optimize Buchberger's algorithm introduced in [Peifer, Stillman, and Halpern-Leistner, 2020]. We show that a multiple linear regression model built from a set of easy-to-compute ideal generator statistics can predict the number of polynomial additions somewhat well, better than an uninformed model, and better than regression models built on some intuitive commutative algebra invariants that are more difficult to compute. We also train a simple recursive neural network that outperforms these linear models. Our work serves as a proof of concept, demonstrating that predicting the number of polynomial additions in Buchberger's algorithm is a feasible problem from the point of view of machine learning.  ( 2 min )
    Fine Timing and Frequency Synchronization for MIMO-OFDM: An Extreme Learning Approach. (arXiv:2007.09248v5 [eess.SP] UPDATED)
    Multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) is a key technology component in the evolution towards cognitive radio (CR) in next-generation communication in which the accuracy of timing and frequency synchronization significantly impacts the overall system performance. In this paper, we propose a novel scheme leveraging extreme learning machine (ELM) to achieve high-precision synchronization. Specifically, exploiting the preamble signals with synchronization offsets, two ELMs are incorporated into a traditional MIMO-OFDM system to estimate both the residual symbol timing offset (RSTO) and the residual carrier frequency offset (RCFO). The simulation results show that the performance of the proposed ELM-based synchronization scheme is superior to the traditional method under both additive white Gaussian noise (AWGN) and frequency selective fading channels. Furthermore, comparing with the existing machine learning based techniques, the proposed method shows outstanding performance without the requirement of perfect channel state information (CSI) and prohibitive computational complexity. Finally, the proposed method is robust in terms of the choice of channel parameters (e.g., number of paths) and also in terms of "generalization ability" from a machine learning standpoint.  ( 2 min )
    Differentially Private Shapley Values for Data Evaluation. (arXiv:2206.00511v1 [cs.LG])
    The Shapley value has been proposed as a solution to many applications in machine learning, including for equitable valuation of data. Shapley values are computationally expensive and involve the entire dataset. The query for a point's Shapley value can also compromise the statistical privacy of other data points. We observe that in machine learning problems such as empirical risk minimization, and in many learning algorithms (such as those with uniform stability), a diminishing returns property holds, where marginal benefit per data point decreases rapidly with data sample size. Based on this property, we propose a new stratified approximation method called the Layered Shapley Algorithm. We prove that this method operates on small (O(\polylog(n))) random samples of data and small sized ($O(\log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy, and can be modified to incorporate differential privacy. Experimental results show that the algorithm correctly identifies high-value data points that improve validation accuracy, and that the differentially private evaluations preserve approximate ranking of data.  ( 2 min )
    Proximally Sensitive Error for Anomaly Detection and Feature Learning. (arXiv:2206.00506v1 [cs.CV])
    Mean squared error (MSE) is one of the most widely used metrics to expression differences between multi-dimensional entities, including images. However, MSE is not locally sensitive as it does not take into account the spatial arrangement of the (pixel) differences, which matters for structured data types like images. Such spatial arrangements carry information about the source of the differences; therefore, an error function that also incorporates the location of errors can lead to a more meaningful distance measure. We introduce Proximally Sensitive Error (PSE), through which we suggest that a regional emphasis in the error measure can 'highlight' semantic differences between images over syntactic/random deviations. We demonstrate that this emphasis can be leveraged upon for the task of anomaly/occlusion detection. We further explore its utility as a loss function to help a model focus on learning representations of semantic objects instead of minimizing syntactic reconstruction noise.  ( 2 min )
    Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis. (arXiv:2206.00632v1 [math.OC])
    When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffle-once (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain our results, we study the power spectral density of the stochastic gradient noise sequences. Our analysis extends beyond SGD to SGD with momentum and to the stochastic Nesterov's accelerated gradient method. We perform experiments on quadratic objective functions to test the validity of our approximation and the correctness of our findings.  ( 2 min )
    Verified Probabilistic Policies for Deep Reinforcement Learning. (arXiv:2201.03698v2 [cs.AI] UPDATED)
    Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent's interaction with its environment. There is also growing interest in formally verifying that such policies are correct and execute safely. Progress has been made in this area by building on existing work for verification of deep neural networks and of continuous-state dynamical systems. In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs. We propose an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy's execution, and present techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and probabilistic model checking. We implement our approach and illustrate its effectiveness on a selection of reinforcement learning benchmarks.  ( 2 min )
    The Representation Jensen-R\'enyi Divergence. (arXiv:2112.01583v4 [cs.LG] UPDATED)
    We introduce a divergence measure between data distributions based on operators in reproducing kernel Hilbert spaces defined by kernels. The empirical estimator of the divergence is computed using the eigenvalues of positive definite Gram matrices that are obtained by evaluating the kernel over pairs of data points. The new measure shares similar properties to Jensen-Shannon divergence. Convergence of the proposed estimators follows from concentration results based on the difference between the ordered spectrum of the Gram matrices and the integral operators associated with the population quantities. The proposed measure of divergence avoids the estimation of the probability distribution underlying the data. Numerical experiments involving comparing distributions and applications to sampling unbalanced data for classification show that the proposed divergence can achieve state of the art results.  ( 2 min )
    A Near-Optimal Best-of-Both-Worlds Algorithm for Online Learning with Feedback Graphs. (arXiv:2206.00557v1 [cs.LG])
    We consider online learning with feedback graphs, a sequential decision-making framework where the learner's feedback is determined by a directed graph over the action set. We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both stochastic and adversarial environments. The bound against oblivious adversaries is $\tilde{O} (\sqrt{\alpha T})$, where $T$ is the time horizon and $\alpha$ is the independence number of the feedback graph. The bound against stochastic environments is $O\big( (\ln T)^2 \max_{S\in \mathcal I(G)} \sum_{i \in S} \Delta_i^{-1}\big)$ where $\mathcal I(G)$ is the family of all independent sets in a suitably defined undirected version of the graph and $\Delta_i$ are the suboptimality gaps. The algorithm combines ideas from the EXP3++ algorithm for stochastic and adversarial bandits and the EXP3.G algorithm for feedback graphs with a novel exploration scheme. The scheme, which exploits the structure of the graph to reduce exploration, is key to obtain best-of-both-worlds guarantees with feedback graphs. We also extend our algorithm and results to a setting where the feedback graphs are allowed to change over time.  ( 2 min )
    DeepCluE: Enhanced Image Clustering via Multi-layer Ensembles in Deep Neural Networks. (arXiv:2206.00359v1 [cs.CV])
    Deep clustering has recently emerged as a promising technique for complex image clustering. Despite the significant progress, previous deep clustering works mostly tend to construct the final clustering by utilizing a single layer of representation, e.g., by performing $K$-means on the last fully-connected layer or by associating some clustering loss to a specific layer. However, few of them have considered the possibilities and potential benefits of jointly leveraging multi-layer representations for enhancing the deep clustering performance. In light of this, this paper presents a Deep Clustering via Ensembles (DeepCluE) approach, which bridges the gap between deep clustering and ensemble clustering by harnessing the power of multiple layers in deep neural networks. Particularly, we utilize a weight-sharing convolutional neural network as the backbone, which is trained with both the instance-level contrastive learning (via an instance projector) and the cluster-level contrastive learning (via a cluster projector) in an unsupervised manner. Thereafter, multiple layers of feature representations are extracted from the trained network, upon which a set of diversified base clusterings can be generated via a highly efficient clusterer. Then, the reliability of the clusters in multiple base clusterings is automatically estimated by exploiting an entropy-based criterion, based on which the multiple base clusterings are further formulated into a weighted-cluster bipartite graph. By partitioning this bipartite graph via transfer cut, the final image clustering result can therefore be obtained. Experimental results on six image datasets confirm the advantages of our DeepCluE approach over the state-of-the-art deep clustering approaches.  ( 2 min )
    Social Bias Meets Data Bias: The Impacts of Labeling and Measurement Errors on Fairness Criteria. (arXiv:2206.00137v1 [cs.LG])
    Although many fairness criteria have been proposed to ensure that machine learning algorithms do not exhibit or amplify our existing social biases, these algorithms are trained on datasets that can themselves be statistically biased. In this paper, we investigate the robustness of a number of existing (demographic) fairness criteria when the algorithm is trained on biased data. We consider two forms of dataset bias: errors by prior decision makers in the labeling process, and errors in measurement of the features of disadvantaged individuals. We analytically show that some constraints (such as Demographic Parity) can remain robust when facing certain statistical biases, while others (such as Equalized Odds) are significantly violated if trained on biased data. We also analyze the sensitivity of these criteria and the decision maker's utility to biases. We provide numerical experiments based on three real-world datasets (the FICO, Adult, and German credit score datasets) supporting our analytical findings. Our findings present an additional guideline for choosing among existing fairness criteria, or for proposing new criteria, when available datasets may be biased.  ( 2 min )
    Learning Instance-Specific Data Augmentations. (arXiv:2206.00051v1 [cs.LG])
    Existing data augmentation methods typically assume independence between transformations and inputs: they use the same transformation distribution for all input instances. We explain why this can be problematic and propose InstaAug, a method for automatically learning input-specific augmentations from data. This is achieved by introducing an augmentation module that maps an input to a distribution over transformations. This is simultaneously trained alongside the base model in a fully end-to-end manner using only the training data. We empirically demonstrate that InstaAug learns meaningful augmentations for a wide range of transformation classes, which in turn provides better performance on supervised and self-supervised tasks compared with augmentations that assume input--transformation independence.  ( 2 min )
    Mario Plays on a Manifold: Generating Functional Content in Latent Space through Differential Geometry. (arXiv:2206.00106v1 [cs.LG])
    Deep generative models can automatically create content of diverse types. However, there are no guarantees that such content will satisfy the criteria necessary to present it to end-users and be functional, e.g. the generated levels could be unsolvable or incoherent. In this paper we study this problem from a geometric perspective, and provide a method for reliable interpolation and random walks in the latent spaces of Categorical VAEs based on Riemannian geometry. We test our method with "Super Mario Bros" and "The Legend of Zelda" levels, and against simpler baselines inspired by current practice. Results show that the geometry we propose is better able to interpolate and sample, reliably staying closer to parts of the latent space that decode to playable content.  ( 2 min )
    CASSOCK: Viable Backdoor Attacks against DNN in The Wall of Source-Specific Backdoor Defences. (arXiv:2206.00145v1 [cs.CR])
    Backdoor attacks have been a critical threat to deep neural network (DNN). However, most existing countermeasures focus on source-agnostic backdoor attacks (SABAs) and fail to defeat source-specific backdoor attacks (SSBAs). Compared to an SABA, an SSBA activates a backdoor when an input from attacker-chosen class(es) is stamped with an attacker-specified trigger, making itself stealthier and thus evade most existing backdoor mitigation. Nonetheless, existing SSBAs have trade-offs on attack success rate (ASR, a backdoor is activated by a trigger input from a source class as expected) and false positive rate (FPR, a backdoor is activated unexpectedly by a trigger input from a non-source class). Significantly, they can still be effectively detected by the state-of-the-art (SOTA) countermeasures targeting SSBAs. This work overcomes efficiency and effectiveness deficiencies of existing SSBAs, thus bypassing the SOTA defences. The key insight is to construct desired poisoned and cover data during backdoor training by characterising SSBAs in-depth. Both data are samples with triggers: the cover/poisoned data from non-source/source class(es) holds ground-truth/target labels. Therefore, two cover/poisoned data enhancements are developed from trigger style and content, respectively, coined CASSOCK. First, we leverage trigger patterns with discrepant transparency to craft cover/poisoned data, enforcing triggers with heterogeneous sensitivity on different classes. The second enhancement chooses the target class features as triggers to craft these samples, entangling trigger features with the target class heavily. Compared with existing SSBAs, CASSOCK-based attacks have higher ASR and low FPR on four popular tasks: MNIST, CIFAR10, GTSRB, and LFW. More importantly, CASSOCK has effectively evaded three defences (SCAn, Februus and extended Neural Cleanse) already defeat existing SSBAs effectively.  ( 2 min )
    Easy Variational Inference for Categorical Models via an Independent Binary Approximation. (arXiv:2206.00093v1 [stat.ML])
    We pursue tractable Bayesian analysis of generalized linear models (GLMs) for categorical data. Thus far, GLMs are difficult to scale to more than a few dozen categories due to non-conjugacy or strong posterior dependencies when using conjugate auxiliary variable methods. We define a new class of GLMs for categorical data called categorical-from-binary (CB) models. Each CB model has a likelihood that is bounded by the product of binary likelihoods, suggesting a natural posterior approximation. This approximation makes inference straightforward and fast; using well-known auxiliary variables for probit or logistic regression, the product of binary models admits conjugate closed-form variational inference that is embarrassingly parallel across categories and invariant to category ordering. Moreover, an independent binary model simultaneously approximates multiple CB models. Bayesian model averaging over these can improve the quality of the approximation for any given dataset. We show that our approach scales to thousands of categories, outperforming posterior estimation competitors like Automatic Differentiation Variational Inference (ADVI) and No U-Turn Sampling (NUTS) in the time required to achieve fixed prediction quality.  ( 2 min )
    A Theoretical Framework for Inference Learning. (arXiv:2206.00164v1 [cs.NE])
    Backpropagation (BP) is the most successful and widely used algorithm in deep learning. However, the computations required by BP are challenging to reconcile with known neurobiology. This difficulty has stimulated interest in more biologically plausible alternatives to BP. One such algorithm is the inference learning algorithm (IL). IL has close connections to neurobiological models of cortical function and has achieved equal performance to BP on supervised learning and auto-associative tasks. In contrast to BP, however, the mathematical foundations of IL are not well-understood. Here, we develop a novel theoretical framework for IL. Our main result is that IL closely approximates an optimization method known as implicit stochastic gradient descent (implicit SGD), which is distinct from the explicit SGD implemented by BP. Our results further show how the standard implementation of IL can be altered to better approximate implicit SGD. Our novel implementation considerably improves the stability of IL across learning rates, which is consistent with our theory, as a key property of implicit SGD is its stability. We provide extensive simulation results that further support our theoretical interpretations and also demonstrate IL achieves quicker convergence when trained with small mini-batches while matching the performance of BP for large mini-batches.  ( 2 min )
    Interpretable Deep Learning Classifier by Detection of Prototypical Parts on Kidney Stones Images. (arXiv:2206.00252v1 [cs.CV])
    Identifying the type of kidney stones can allow urologists to determine their formation cause, improving the early prescription of appropriate treatments to diminish future relapses. However, currently, the associated ex-vivo diagnosis (known as morpho-constitutional analysis, MCA) is time-consuming, expensive, and requires a great deal of experience, as it requires a visual analysis component that is highly operator dependant. Recently, machine learning methods have been developed for in-vivo endoscopic stone recognition. Shallow methods have been demonstrated to be reliable and interpretable but exhibit low accuracy, while deep learning-based methods yield high accuracy but are not explainable. However, high stake decisions require understandable computer-aided diagnosis (CAD) to suggest a course of action based on reasonable evidence, rather than merely prescribe one. Herein, we investigate means for learning part-prototypes (PPs) that enable interpretable models. Our proposal suggests a classification for a kidney stone patch image and provides explanations in a similar way as those used on the MCA method.
    Discovering the Hidden Vocabulary of DALLE-2. (arXiv:2206.00169v1 [cs.LG])
    We discover that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. For example, it seems that \texttt{Apoploe vesrreaitais} means birds and \texttt{Contarra ccetnxniams luryca tanniounons} (sometimes) means bugs or pests. We find that these prompts are often consistent in isolation but also sometimes in combinations. We present our black-box method to discover words that seem random but have some correspondence to visual concepts. This creates important security and interpretability challenges.
    IGLU Gridworld: Simple and Fast Environment for Embodied Dialog Agents. (arXiv:2206.00142v1 [cs.LG])
    We present the IGLU Gridworld: a reinforcement learning environment for building and evaluating language conditioned embodied agents in a scalable way. The environment features visual agent embodiment, interactive learning through collaboration, language conditioned RL, and combinatorically hard task (3d blocks building) space.  ( 2 min )
    Privacy for Free: How does Dataset Condensation Help Privacy?. (arXiv:2206.00240v1 [cs.CR])
    To prevent unintentional data leakage, research community has resorted to data generators that can produce differentially private data for model training. However, for the sake of the data privacy, existing solutions suffer from either expensive training cost or poor generalization performance. Therefore, we raise the question whether training efficiency and privacy can be achieved simultaneously. In this work, we for the first time identify that dataset condensation (DC) which is originally designed for improving training efficiency is also a better solution to replace the traditional data generators for private data generation, thus providing privacy for free. To demonstrate the privacy benefit of DC, we build a connection between DC and differential privacy, and theoretically prove on linear feature extractors (and then extended to non-linear feature extractors) that the existence of one sample has limited impact ($O(m/n)$) on the parameter distribution of networks trained on $m$ samples synthesized from $n (n \gg m)$ raw samples by DC. We also empirically validate the visual privacy and membership privacy of DC-synthesized data by launching both the loss-based and the state-of-the-art likelihood-based membership inference attacks. We envision this work as a milestone for data-efficient and privacy-preserving machine learning.
    Bounding Membership Inference. (arXiv:2202.12232v2 [cs.LG] UPDATED)
    Differential Privacy (DP) is the de facto standard for reasoning about the privacy guarantees of a training algorithm. Despite the empirical observation that DP reduces the vulnerability of models to existing membership inference (MI) attacks, a theoretical underpinning as to why this is the case is largely missing in the literature. In practice, this means that models need to be trained with DP guarantees that greatly decrease their accuracy. In this paper, we provide a tighter bound on the positive accuracy (i.e., attack precision) of any MI adversary when a training algorithm provides $\epsilon$-DP or $(\epsilon, \delta)$-DP. Our bound informs the design of a novel privacy amplification scheme, where an effective training set is sub-sampled from a larger set prior to the beginning of training, to greatly reduce the bound on MI accuracy. As a result, our scheme enables DP users to employ looser DP guarantees when training their model to limit the success of any MI adversary; this ensures that the model's accuracy is less impacted by the privacy guarantee. Finally, we discuss implications of our MI bound on the field of machine unlearning.
    Automatic differentiation of nonsmooth iterative algorithms. (arXiv:2206.00457v1 [math.OC])
    Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and can it be used effectively in machine learning? Is there a connection with classical derivative? All these questions are addressed under appropriate nonexpansivity conditions in the framework of conservative derivatives which has proved useful in understanding nonsmooth AD. For nonsmooth piggyback iterations, we characterize the attractor set of nonsmooth piggyback iterations as a set-valued fixed point which remains in the conservative framework. This has various consequences and in particular almost everywhere convergence of classical derivatives. Our results are illustrated on parametric convex optimization problems with forward-backward, Douglas-Rachford and Alternating Direction of Multiplier algorithms as well as the Heavy-Ball method.
    Deep learning pipeline for image classification on mobile phones. (arXiv:2206.00105v1 [eess.IV])
    This article proposes and documents a machine-learning framework and tutorial for classifying images using mobile phones. Compared to computers, the performance of deep learning model performance degrades when deployed on a mobile phone and requires a systematic approach to find a model that performs optimally on both computers and mobile phones. By following the proposed pipeline, which consists of various computational tools, simple procedural recipes, and technical considerations, one can bring the power of deep learning medical image classification to mobile devices, potentially unlocking new domains of applications. The pipeline is demonstrated on four different publicly available datasets: COVID X-rays, COVID CT scans, leaves, and colorectal cancer. We used two application development frameworks: TensorFlow Lite (real-time testing) and Flutter (digital image testing) to test the proposed pipeline. We found that transferring deep learning models to a mobile phone is limited by hardware and classification accuracy drops. To address this issue, we proposed this pipeline to find an optimized model for mobile phones. Finally, we discuss additional applications and computational concerns related to deploying deep-learning models on phones, including real-time analysis and image preprocessing. We believe the associated documentation and code can help physicians and medical experts develop medical image classification applications for distribution.  ( 2 min )
    Universal Early Warning Signals of Phase Transitions in Climate Systems. (arXiv:2206.00060v1 [physics.ao-ph])
    The potential for complex systems to exhibit tipping points in which an equilibrium state undergoes a sudden and potentially irreversible shift is well established, but prediction of these events using standard forecast modeling techniques is quite difficult. This has led to the development of an alternative suite of methods that seek to identify signatures of critical phenomena in data, which are expected to occur in advance of many classes of dynamical bifurcation. Crucially, the manifestations of these critical phenomena are generic across a variety of systems, meaning that data-intensive deep learning methods can be trained on (abundant) synthetic data and plausibly prove effective when transferred to (more limited) empirical data sets. This paper provides a proof of concept for this approach as applied to lattice phase transitions: a deep neural network trained exclusively on 2D Ising model phase transitions is tested on a number of real and simulated climate systems with considerable success. Its accuracy frequently surpasses that of conventional statistical indicators, with performance shown to be consistently improved by the inclusion of spatial indicators. Tools such as this may offer valuable insight into climate tipping events, as remote sensing measurements provide increasingly abundant data on complex geospatially-resolved Earth systems.  ( 2 min )
    Provably and Practically Efficient Neural Contextual Bandits. (arXiv:2206.00099v1 [stat.ML])
    We consider the neural contextual bandit problem. In contrast to the existing work which primarily focuses on ReLU neural nets, we consider a general set of smooth activation functions. Under this more general setting, (i) we derive non-asymptotic error bounds on the difference between an overparameterized neural net and its corresponding neural tangent kernel, (ii) we propose an algorithm with a provably sublinear regret bound that is also efficient in the finite regime as demonstrated by empirical studies. The non-asymptotic error bounds may be of broader interest as a tool to establish the relation between the smoothness of the activation functions in neural contextual bandits and the smoothness of the kernels in kernel bandits.  ( 2 min )
    Are classical neural networks quantum?. (arXiv:2206.00005v1 [cs.LG])
    Neural networks are being used to improve the probing of the state spaces of many particle systems as approximations to wavefunctions and in order to avoid the recurring sign problem of quantum monte-carlo. One may ask whether the usual classical neural networks have some actual hidden quantum properties that make them such suitable tools for a highly coupled quantum problem. I discuss here what makes a system quantum and to what extent we can interpret a neural network as having quantum remnants.  ( 2 min )
    Semantically-enhanced Topic Recommendation System for Software Projects. (arXiv:2206.00085v1 [cs.SE])
    Software-related platforms have enabled their users to collaboratively label software entities with topics. Tagging software repositories with relevant topics can be exploited for facilitating various downstream tasks. For instance, a correct and complete set of topics assigned to a repository can increase its visibility. Consequently, this improves the outcome of tasks such as browsing, searching, navigation, and organization of repositories. Unfortunately, assigned topics are usually highly noisy, and some repositories do not have well-assigned topics. Thus, there have been efforts on recommending topics for software projects, however, the semantic relationships among these topics have not been exploited so far. We propose two recommender models for tagging software projects that incorporate the semantic relationship among topics. Our approach has two main phases; (1) we first take a collaborative approach to curate a dataset of quality topics specifically for the domain of software engineering and development. We also enrich this data with the semantic relationships among these topics and encapsulate them in a knowledge graph we call SED-KGraph. Then, (2) we build two recommender systems; The first one operates only based on the list of original topics assigned to a repository and the relationships specified in our knowledge graph. The second predictive model, however, assumes there are no topics available for a repository, hence it proceeds to predict the relevant topics based on both textual information of a software project and SED-KGraph. We built SED-KGraph in a crowd-sourced project with 170 contributors from both academia and industry. The experiment results indicate that our solutions outperform baselines that neglect the semantic relationships among topics by at least 25% and 23% in terms of ASR and MAP metrics.  ( 2 min )
    Extensive Study of Multiple Deep Neural Networks for Complex Random Telegraph Signals. (arXiv:2206.00086v1 [physics.app-ph])
    Time-fluctuating signals are ubiquitous and diverse in many physical, chemical, and biological systems, among which random telegraph signals (RTSs) refer to a series of instantaneous switching events between two discrete levels from single-particle movements. Reliable RTS analyses are crucial prerequisite to identify underlying mechanisms related to performance sensitivity. When numerous levels partake, complex patterns of multilevel RTSs occur, making their quantitative analysis exponentially difficult, hereby systematic approaches are found elusive. Here, we present a three-step analysis protocol via progressive knowledge-transfer, where the outputs of early step are passed onto a subsequent step. Especially, to quantify complex RTSs, we build three deep neural network architectures that can process temporal data well and demonstrate the model accuracy extensively with a large dataset of different RTS types affected by controlling background noise size. Our protocol offers structured schemes to quantify complex RTSs from which meaningful interpretation and inference can ensue.  ( 2 min )
    End-to-end Optimization of Machine Learning Prediction Queries. (arXiv:2206.00136v1 [cs.DB])
    Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for optimization unexplored. We present Raven, a production-ready system for optimizing prediction queries. Raven follows the enterprise architectural trend of collocating data and ML runtimes. It relies on a unified intermediate representation that captures both data and ML operators in a single graph structure to unlock two families of optimizations. First, it employs logical optimizations that pass information between the data part (and the properties of the underlying data) and the ML part to optimize each other. Second, it introduces logical-to-physical transformations that allow operators to be executed on different runtimes (relational, ML, and DNN) and hardware (CPU, GPU). Novel data-driven optimizations determine the runtime to be used for each part of the query to achieve optimal performance. Our evaluation shows that Raven improves performance of prediction queries on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems.  ( 2 min )
    Communication-efficient distributed eigenspace estimation with arbitrary node failures. (arXiv:2206.00127v1 [stat.ML])
    We develop an eigenspace estimation algorithm for distributed environments with arbitrary node failures, where a subset of computing nodes can return structurally valid but otherwise arbitrarily chosen responses. Notably, this setting encompasses several important scenarios that arise in distributed computing and data-collection environments such as silent/soft errors, outliers or corrupted data at certain nodes, and adversarial responses. Our estimator builds upon and matches the performance of a recently proposed non-robust estimator up to an additive $\tilde{O}(\sigma \sqrt{\alpha})$ error, where $\sigma^2$ is the variance of the existing estimator and $\alpha$ is the fraction of corrupted nodes.  ( 2 min )
    Generative Models with Information-Theoretic Protection Against Membership Inference Attacks. (arXiv:2206.00071v1 [cs.LG])
    Deep generative models, such as Generative Adversarial Networks (GANs), synthesize diverse high-fidelity data samples by estimating the underlying distribution of high dimensional data. Despite their success, GANs may disclose private information from the data they are trained on, making them susceptible to adversarial attacks such as membership inference attacks, in which an adversary aims to determine if a record was part of the training set. We propose an information theoretically motivated regularization term that prevents the generative model from overfitting to training data and encourages generalizability. We show that this penalty minimizes the JensenShannon divergence between components of the generator trained on data with different membership, and that it can be implemented at low cost using an additional classifier. Our experiments on image datasets demonstrate that with the proposed regularization, which comes at only a small added computational cost, GANs are able to preserve privacy and generate high-quality samples that achieve better downstream classification performance compared to non-private and differentially private generative models.  ( 2 min )
    PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs. (arXiv:2206.00048v1 [cs.CV])
    Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.  ( 2 min )
    Near-Optimal Collaborative Learning in Bandits. (arXiv:2206.00121v1 [cs.LG])
    This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.  ( 2 min )
    A Cross-City Federated Transfer Learning Framework: A Case Study on Urban Region Profiling. (arXiv:2206.00007v1 [cs.LG])
    Data insufficiency problem (i.e., data missing and label scarcity issues) caused by inadequate services and infrastructures or unbalanced development levels of cities has seriously affected the urban computing tasks in real scenarios. Prior transfer learning methods inspire an elegant solution to the data insufficiency, but are only concerned with one kind of insufficiency issue and fail to fully explore these two issues existing in the real world. In addition, cross-city transfer in existing methods overlooks the inter-city data privacy which is a public concern in practical application. To address the above challenging problems, we propose a novel Cross-city Federated Transfer Learning framework (CcFTL) to cope with the data insufficiency and privacy problems. Concretely, CcFTL transfers the relational knowledge from multiple rich-data source cities to the target city. Besides, the model parameters specific to the target task are firstly trained on the source data and then fine-tuned to the target city by parameter transfer. With our adaptation of federated training and homomorphic encryption settings, CcFTL can effectively deal with the data privacy problem among cities. We take the urban region profiling as an application of smart cities and evaluate the proposed method with a real-world study. The experiments demonstrate the notable superiority of our framework over several competitive state-of-the-art models.  ( 2 min )
    Principle of Relevant Information for Graph Sparsification. (arXiv:2206.00118v1 [cs.LG])
    Graph sparsification aims to reduce the number of edges of a graph while maintaining its structural properties. In this paper, we propose the first general and effective information-theoretic formulation of graph sparsification, by taking inspiration from the Principle of Relevant Information (PRI). To this end, we extend the PRI from a standard scalar random variable setting to structured data (i.e., graphs). Our Graph-PRI objective is achieved by operating on the graph Laplacian, made possible by expressing the graph Laplacian of a subgraph in terms of a sparse edge selection vector $\mathbf{w}$. We provide both theoretical and empirical justifications on the validity of our Graph-PRI approach. We also analyze its analytical solutions in a few special cases. We finally present three representative real-world applications, namely graph sparsification, graph regularized multi-task learning, and medical imaging-derived brain network classification, to demonstrate the effectiveness, the versatility and the enhanced interpretability of our approach over prevalent sparsification techniques. Code of Graph-PRI is available at https://github.com/SJYuCNEL/PRI-Graphs  ( 2 min )
    Weight Set Decomposition for Weighted Rank Aggregation: An interpretable and visual decision support tool. (arXiv:2206.00001v1 [cs.IR])
    The problem of interpreting or aggregating multiple rankings is common to many real-world applications. Perhaps the simplest and most common approach is a weighted rank aggregation, wherein a (convex) weight is applied to each input ranking and then ordered. This paper describes a new tool for visualizing and displaying ranking information for the weighted rank aggregation method. Traditionally, the aim of rank aggregation is to summarize the information from the input rankings and provide one final ranking that hopefully represents a more accurate or truthful result than any one input ranking. While such an aggregated ranking is, and clearly has been, useful to many applications, it also obscures information. In this paper, we show the wealth of information that is available for the weighted rank aggregation problem due to its structure. We apply weight set decomposition to the set of convex multipliers, study the properties useful for understanding this decomposition, and visualize the indifference regions. This methodology reveals information--that is otherwise collapsed by the aggregated ranking--into a useful, interpretable, and intuitive decision support tool. Included are multiple illustrative examples, along with heuristic and exact algorithms for computing the weight set decomposition.  ( 2 min )
    FiLM-Ensemble: Probabilistic Deep Learning via Feature-wise Linear Modulation. (arXiv:2206.00050v1 [cs.LG])
    The ability to estimate epistemic uncertainty is often crucial when deploying machine learning in the real world, but modern methods often produce overconfident, uncalibrated uncertainty predictions. A common approach to quantify epistemic uncertainty, usable across a wide class of prediction models, is to train a model ensemble. In a naive implementation, the ensemble approach has high computational cost and high memory demand. This challenges in particular modern deep learning, where even a single deep network is already demanding in terms of compute and memory, and has given rise to a number of attempts to emulate the model ensemble without actually instantiating separate ensemble members. We introduce FiLM-Ensemble, a deep, implicit ensemble method based on the concept of Feature-wise Linear Modulation (FiLM). That technique was originally developed for multi-task learning, with the aim of decoupling different tasks. We show that the idea can be extended to uncertainty quantification: by modulating the network activations of a single deep network with FiLM, one obtains a model ensemble with high diversity, and consequently well-calibrated estimates of epistemic uncertainty, with low computational overhead in comparison. Empirically, FiLM-Ensemble outperforms other implicit ensemble methods, and it and comes very close to the upper bound of an explicit ensemble of networks (sometimes even beating it), at a fraction of the memory cost.  ( 2 min )
    On Analyzing Generative and Denoising Capabilities of Diffusion-based Deep Generative Models. (arXiv:2206.00070v1 [cs.LG])
    Diffusion-based Deep Generative Models (DDGMs) offer state-of-the-art performance in generative modeling. Their main strength comes from their unique setup in which a model (the backward diffusion process) is trained to reverse the forward diffusion process, which gradually adds noise to the input signal. Although DDGMs are well studied, it is still unclear how the small amount of noise is transformed during the backward diffusion process. Here, we focus on analyzing this problem to gain more insight into the behavior of DDGMs and their denoising and generative capabilities. We observe a fluid transition point that changes the functionality of the backward diffusion process from generating a (corrupted) image from noise to denoising the corrupted image to the final sample. Based on this observation, we postulate to divide a DDGM into two parts: a denoiser and a generator. The denoiser could be parameterized by a denoising auto-encoder, while the generator is a diffusion-based model with its own set of parameters. We experimentally validate our proposition, showing its pros and cons.  ( 2 min )
    Online PAC-Bayes Learning. (arXiv:2206.00024v1 [cs.LG])
    Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning.  ( 2 min )
    Evolving Domain Generalization. (arXiv:2206.00047v1 [cs.LG])
    Domain generalization aims to learn a predictive model from multiple different but related source tasks that can generalize well to a target task without the need of accessing any target data. Existing domain generalization methods ignore the relationship between tasks, implicitly assuming that all the tasks are sampled from a stationary environment. Therefore, they can fail when deployed in an evolving environment. To this end, we formulate and study the \emph{evolving domain generalization} (EDG) scenario, which exploits not only the source data but also their evolving pattern to generate a model for the unseen task. Our theoretical result reveals the benefits of modeling the relation between two consecutive tasks by learning a globally consistent directional mapping function. In practice, our analysis also suggests solving the DDG problem in a meta-learning manner, which leads to \emph{directional prototypical network}, the first method for the DDG problem. Empirical evaluation of both synthetic and real-world data sets validates the effectiveness of our approach.  ( 2 min )
    COIN: Co-Cluster Infomax for Bipartite Graphs. (arXiv:2206.00006v1 [cs.LG])
    Bipartite graphs are powerful data structures to model interactions between two types of nodes, which have been used in a variety of applications, such as recommender systems, information retrieval, and drug discovery. A fundamental challenge for bipartite graphs is how to learn informative node embeddings. Despite the success of recent self-supervised learning methods on bipartite graphs, their objectives are discriminating instance-wise positive and negative node pairs, which could contain cluster-level errors. In this paper, we introduce a novel co-cluster infomax (COIN) framework, which captures the cluster-level information by maximizing the mutual information of co-clusters. Different from previous infomax methods which estimate mutual information by neural networks, COIN could easily calculate mutual information. Besides, COIN is an end-to-end co-clustering method which can be trained jointly with other objective functions and optimized via back-propagation. Furthermore, we also provide theoretical analysis for COIN. We theoretically prove that COIN is able to effectively maximize the mutual information of node embeddings and COIN is upper-bounded by the prior distributions of nodes. We extensively evaluate the proposed COIN framework on various benchmark datasets and tasks to demonstrate the effectiveness of COIN.  ( 2 min )
    Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization. (arXiv:2206.00057v1 [cs.LG])
    Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train a GNN on large graphs, which are prevalent in various applications such as social network, recommender systems, and knowledge graphs. Traditional sampling-based methods accelerate GNN by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms, which accelerate GNN training by utilizing multiple computing devices, can be classified into two types: "partition-based" methods enjoy low communication costs but suffer from information loss due to dropped edges, while "propagation-based" methods avoid information loss but suffer prohibitive communication overhead. To jointly address these problems, this paper proposes DIstributed Graph Embedding SynchronizaTion (DIGEST), a novel distributed GNN training framework that synergizes the complementary strength of both categories of existing methods. During subgraph parallel training, we propose to let each device store the historical embedding of its neighbors in other subgraphs. Therefore, our method does not discard any neighbors in other subgraphs, nor does it updates them intensively. This effectively avoids (1) the intensive computation on explosively-increasing neighbors and (2) excessive communications across different devices. We proved that the approximation error induced by the staleness of historical embedding can be upper bounded and it does NOT affect the GNN model's expressiveness. More importantly, our convergence analysis demonstrates that DIGEST enjoys a state-of-the-art convergence rate. Extensive experimental evaluation on large, real-world graph datasets shows that DIGEST achieves up to $21.82\times$ speedup without compromising the performance compared to state-of-the-art distributed GNN training frameworks.  ( 2 min )
  • Open

    Asymptotics of Network Embeddings Learned via Subsampling. (arXiv:2107.02363v2 [stat.ML] UPDATED)
    Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.  ( 2 min )
    Online PAC-Bayes Learning. (arXiv:2206.00024v1 [cs.LG])
    Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning.
    An $\alpha$-No-Regret Algorithm For Graphical Bilinear Bandits. (arXiv:2206.00466v1 [cs.LG])
    We propose the first regret-based approach to the Graphical Bilinear Bandits problem, where $n$ agents in a graph play a stochastic bilinear bandit game with each of their neighbors. This setting reveals a combinatorial NP-hard problem that prevents the use of any existing regret-based algorithm in the (bi-)linear bandit literature. In this paper, we fill this gap and present the first regret-based algorithm for graphical bilinear bandits using the principle of optimism in the face of uncertainty. Theoretical analysis of this new method yields an upper bound of $\tilde{O}(\sqrt{T})$ on the $\alpha$-regret and evidences the impact of the graph structure on the rate of convergence. Finally, we show through various experiments the validity of our approach.
    AgraSSt: Approximate Graph Stein Statistics for Interpretable Assessment of Implicit Graph Generators. (arXiv:2203.03673v2 [stat.ML] UPDATED)
    We propose and analyse a novel statistical procedure, coined AgraSSt, to assess the quality of graph generators that may not be available in explicit form. In particular, AgraSSt can be used to determine whether a learnt graph generating process is capable of generating graphs that resemble a given input graph. Inspired by Stein operators for random graphs, the key idea of AgraSSt is the construction of a kernel discrepancy based on an operator obtained from the graph generator. AgraSSt can provide interpretable criticisms for a graph generator training procedure and help identify reliable sample batches for downstream tasks. Using Stein`s method we give theoretical guarantees for a broad class of random graph models. We provide empirical results on both synthetic input graphs with known graph generation procedures, and real-world input graphs that the state-of-the-art (deep) generative models for graphs are trained on.
    OOD Link Prediction Generalization Capabilities of Message-Passing GNNs in Larger Test Graphs. (arXiv:2205.15117v2 [cs.LG] UPDATED)
    This work provides the first theoretical study on the ability of graph Message Passing Neural Networks (gMPNNs) -- such as Graph Neural Networks (GNNs) -- to perform inductive out-of-distribution (OOD) link prediction tasks, where deployment (test) graph sizes are larger than training graphs. We first prove non-asymptotic bounds showing that link predictors based on permutation-equivariant (structural) node embeddings obtained by gMPNNs can converge to a random guess as test graphs get larger. We then propose a theoretically-sound gMPNN that outputs structural pairwise (2-node) embeddings and prove non-asymptotic bounds showing that, as test graphs grow, these embeddings converge to embeddings of a continuous function that retains its ability to predict links OOD. Empirical results on random graphs show agreement with our theoretical results.
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v1 [stat.ML])
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike and slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.
    Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis. (arXiv:2206.00632v1 [math.OC])
    When solving finite-sum minimization problems, two common alternatives to stochastic gradient descent (SGD) with theoretical benefits are random reshuffling (SGD-RR) and shuffle-once (SGD-SO), in which functions are sampled in cycles without replacement. Under a convenient stochastic noise approximation which holds experimentally, we study the stationary variances of the iterates of SGD, SGD-RR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations. To obtain our results, we study the power spectral density of the stochastic gradient noise sequences. Our analysis extends beyond SGD to SGD with momentum and to the stochastic Nesterov's accelerated gradient method. We perform experiments on quadratic objective functions to test the validity of our approximation and the correctness of our findings.
    Width is Less Important than Depth in ReLU Neural Networks. (arXiv:2202.03841v2 [cs.LG] UPDATED)
    We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.
    Top-down inference in an early visual cortex inspired hierarchical Variational Autoencoder. (arXiv:2206.00436v1 [q-bio.NC])
    Interpreting computations in the visual cortex as learning and inference in a generative model of the environment has received wide support both in neuroscience and cognitive science. However, hierarchical computations, a hallmark of visual cortical processing, has remained impervious for generative models because of a lack of adequate tools to address it. Here we capitalize on advances in Variational Autoencoders (VAEs) to investigate the early visual cortex with sparse coding hierarchical VAEs trained on natural images. We design alternative architectures that vary both in terms of the generative and the recognition components of the two latent-layer VAE. We show that representations similar to the one found in the primary and secondary visual cortices naturally emerge under mild inductive biases. Importantly, a nonlinear representation for texture-like patterns is a stable property of the high-level latent space resistant to the specific architecture of the VAE, reminiscent of the secondary visual cortex. We show that a neuroscience-inspired choice of the recognition model, which features a top-down processing component is critical for two signatures of computations with generative models: learning higher order moments of the posterior beyond the mean and image inpainting. Patterns in higher order response statistics provide inspirations for neuroscience to interpret response correlations and for machine learning to evaluate the learned representations through more detailed characterization of the posterior.
    A Kernelised Stein Statistic for Assessing Implicit Generative Models. (arXiv:2206.00149v1 [stat.ML])
    Synthetic data generation has become a key ingredient for training machine learning procedures, addressing tasks such as data augmentation, analysing privacy-sensitive data, or visualising representative samples. Assessing the quality of such synthetic data generators hence has to be addressed. As (deep) generative models for synthetic data often do not admit explicit probability distributions, classical statistical procedures for assessing model goodness-of-fit may not be applicable. In this paper, we propose a principled procedure to assess the quality of a synthetic data generator. The procedure is a kernelised Stein discrepancy (KSD)-type test which is based on a non-parametric Stein operator for the synthetic data generator of interest. This operator is estimated from samples which are obtained from the synthetic data generator and hence can be applied even when the model is only implicit. In contrast to classical testing, the sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed. Experimental results on synthetic distributions and trained generative models on synthetic and real datasets illustrate that the method shows improved power performance compared to existing approaches.
    Standardisation-function Kernel Stein Discrepancy: A Unifying View on Kernel Stein Discrepancy Tests for Goodness-of-fit. (arXiv:2106.12105v2 [stat.ME] UPDATED)
    Non-parametric goodness-of-fit testing procedures based on kernel Stein discrepancies (KSD) are promising approaches to validate general unnormalised distributions in various scenarios. Existing works focused on studying kernel choices to boost test performances. However, the choices of (non-unique) Stein operators also have considerable effect on the test performances. Inspired by the standardisation technique that was originally developed to better derive approximation properties for normal distributions, we present a unifying framework, called standardisation-function kernel Stein discrepancy (Sf-KSD), to study different Stein operators in KSD-based tests for goodness-of-fit. We derive explicitly how the proposed framework relates to existing KSD-based tests and show that Sf-KSD can be used as a guide to develop novel kernel-based non-parametric tests on complex data scenarios, e.g. truncated distributions or compositional data. Experimental results demonstrate that the proposed tests control type-I error well and achieve higher test power than existing approaches.
    A model aggregation approach for high-dimensional large-scale optimization. (arXiv:2205.07525v2 [cs.LG] UPDATED)
    Bayesian optimization (BO) has been widely used in machine learning and simulation optimization. With the increase in computational resources and storage capacities in these fields, high-dimensional and large-scale problems are becoming increasingly common. In this study, we propose a model aggregation method in the Bayesian optimization (MamBO) algorithm for efficiently solving high-dimensional large-scale optimization problems. MamBO uses a combination of subsampling and subspace embeddings to collectively address high dimensionality and large-scale issues; in addition, a model aggregation method is employed to address the surrogate model uncertainty issue that arises when embedding is applied. This surrogate model uncertainty issue is largely ignored in the embedding literature and practice, and it is exacerbated when the problem is high-dimensional and data are limited. Our proposed model aggregation method reduces these lower-dimensional surrogate model risks and improves the robustness of the BO algorithm. We derive an asymptotic bound for the proposed aggregated surrogate model and prove the convergence of MamBO. Benchmark numerical experiments indicate that our algorithm achieves superior or comparable performance to other commonly used high-dimensional BO algorithms. Moreover, we apply MamBO to a cascade classifier of a machine learning algorithm for face detection, and the results reveal that MamBO finds settings that achieve higher classification accuracy than the benchmark settings and is computationally faster than other high-dimensional BO algorithms.
    Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation. (arXiv:2206.00270v1 [cs.LG])
    We study lifelong reinforcement learning (RL) in a regret minimization setting of linear contextual Markov decision process (MDP), where the agent needs to learn a multi-task policy while solving a streaming sequence of tasks. We propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks, which may be adaptively chosen based on the agent's past behaviors. Remarkably, our algorithm uses only sublinear number of planning calls, which means that the agent eventually learns a policy that is near optimal for multiple tasks (seen or unseen) without the need of deliberate planning. A key to this property is a new structural assumption that enables computation sharing across tasks during exploration. Specifically, for $K$ task episodes of horizon $H$, our algorithm has a regret bound $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$ based on $\mathcal{O}(dH\log(K))$ number of planning calls, where $d$ and $d^\prime$ are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to accumulate experiences and learn to rapidly solve new tasks.
    Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization. (arXiv:2206.00363v1 [cs.LG])
    We study differentially private (DP) algorithms for smooth stochastic minimax optimization, with stochastic minimization as a byproduct. The holy grail of these settings is to guarantee the optimal trade-off between the privacy and the excess population loss, using an algorithm with a linear time-complexity in the number of training samples. We provide a general framework for solving differentially private stochastic minimax optimization (DP-SMO) problems, which enables the practitioners to bring their own base optimization algorithm and use it as a black-box to obtain the near-optimal privacy-loss trade-off. Our framework is inspired from the recently proposed Phased-ERM method [20] for nonsmooth differentially private stochastic convex optimization (DP-SCO), which exploits the stability of the empirical risk minimization (ERM) for the privacy guarantee. The flexibility of our approach enables us to sidestep the requirement that the base algorithm needs to have bounded sensitivity, and allows the use of sophisticated variance-reduced accelerated methods to achieve near-linear time-complexity. To the best of our knowledge, these are the first linear-time optimal algorithms, up to logarithmic factors, for smooth DP-SMO when the objective is (strongly-)convex-(strongly-)concave. Additionally, based on our flexible framework, we derive a new family of near-linear time algorithms for smooth DP-SCO with optimal privacy-loss trade-offs for a wider range of smoothness parameters compared to previous algorithms.
    Graph Neural Networks are Dynamic Programmers. (arXiv:2203.15544v2 [cs.LG] UPDATED)
    Recent advances in neural algorithmic reasoning with graph neural networks (GNNs) are propped up by the notion of algorithmic alignment. Broadly, a neural network will be better at learning to execute a reasoning task (in terms of sample complexity) if its individual components align well with the target algorithm. Specifically, GNNs are claimed to align with dynamic programming (DP), a general problem-solving strategy which expresses many polynomial-time algorithms. However, has this alignment truly been demonstrated and theoretically quantified? Here we show, using methods from category theory and abstract algebra, that there exists an intricate connection between GNNs and DP, going well beyond the initial observations over individual algorithms such as Bellman-Ford. Exposing this connection, we easily verify several prior findings in the literature, produce better-grounded GNN architectures for edge-centric tasks, and demonstrate empirical results on the CLRS algorithmic reasoning benchmark. We hope our exposition will serve as a foundation for building stronger algorithmically aligned GNNs.
    Realistic Deep Learning May Not Fit Benignly. (arXiv:2206.00501v1 [cs.LG])
    Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the heavy overparameterization setting, benign overfitting can now fail in the presence of label noise. Our study explains our empirical observations, and naturally leads to a simple technique known as self-training that can boost the model's generalization performances. Furthermore, our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.
    Higher-Order Attention Networks. (arXiv:2206.00606v1 [cs.LG])
    This paper introduces higher-order attention networks (HOANs), a novel class of attention-based neural networks defined on a generalized higher-order domain called a combinatorial complex (CC). Similar to hypergraphs, CCs admit arbitrary set-like relations between a collection of abstract entities. Simultaneously, CCs permit the construction of hierarchical higher-order relations analogous to those supported by cell complexes. Thus, CCs effectively generalize both hypergraphs and cell complexes and combine their desirable characteristics. By exploiting the rich combinatorial nature of CCs, HOANs define a new class of message-passing attention-based networks that unifies higher-order neural networks. Our evaluation on tasks related to mesh shape analysis and graph learning demonstrates that HOANs attain competitive, and in some examples superior, predictive performance in comparison to state-of-the-art neural networks.
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v2 [cs.LG] UPDATED)
    We propose a simple and scalable approach to causal representation learning for multitask learning. Our approach requires minimal modification to existing ML systems, and improves robustness to target shift. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as target-causing confounders. These confounders induce spurious dependencies between the input and targets. This poses a problem for the conventional approach to multitask learning, due to its assumption that the targets are conditionally independent given the input. Our proposed approach takes into account the dependencies between the targets in order to alleviate target-causing confounding. All that is required in addition to usual practice is to estimate the joint distribution of the targets to switch from discriminative to generative classification, and to predict all targets jointly. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to target shift.
    Automatic Bounding Box Annotation with Small Training Data Sets for Industrial Manufacturing. (arXiv:2206.00280v1 [cs.CV])
    In the past few years, object detection has attracted a lot of attention in the context of human-robot collaboration and Industry 5.0 due to enormous quality improvements in deep learning technologies. In many applications, object detection models have to be able to quickly adapt to a changing environment, i.e., to learn new objects. A crucial but challenging prerequisite for this is the automatic generation of new training data which currently still limits the broad application of object detection methods in industrial manufacturing. In this work, we discuss how to adapt state-of-the-art object detection methods for the task of automatic bounding box annotation for the use case where the background is homogeneous and the object's label is provided by a human. We compare an adapted version of Faster R-CNN and the Scaled Yolov4-p5 architecture and show that both can be trained to distinguish unknown objects from a complex but homogeneous background using only a small amount of training data.
    Contextual Bandits with Knapsacks for a Conversion Model. (arXiv:2206.00314v1 [cs.LG])
    We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward $r(a,\mathbf{x}_t)$ is gained and vector costs $c(a_t,\mathbf{x}_t)$ are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] but we show that the techniques introduced in this article may also be applied to the latter case. Namely, the adaptive policies exhibited solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order (OPT/$B$) $\sqrt{T}$, where $B$ is the total budget allowed, OPT is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.
    On Quantum Circuits for Discrete Graphical Models. (arXiv:2206.00398v1 [quant-ph])
    Graphical models are useful tools for describing structured high-dimensional probability distributions. Development of efficient algorithms for generating unbiased and independent samples from graphical models remains an active research topic. Sampling from graphical models that describe the statistics of discrete variables is a particularly challenging problem, which is intractable in the presence of high dimensions. In this work, we provide the first method that allows one to provably generate unbiased and independent samples from general discrete factor models with a quantum circuit. Our method is compatible with multi-body interactions and its success probability does not depend on the number of variables. To this end, we identify a novel embedding of the graphical model into unitary operators and provide rigorous guarantees on the resulting quantum state. Moreover, we prove a unitary Hammersley-Clifford theorem -- showing that our quantum embedding factorizes over the cliques of the underlying conditional independence structure. Importantly, the quantum embedding allows for maximum likelihood learning as well as maximum a posteriori state approximation via state-of-the-art hybrid quantum-classical methods. Finally, the proposed quantum method can be implemented on current quantum processors. Experiments with quantum simulation as well as actual quantum hardware show that our method can carry out sampling and parameter learning on quantum computers.
    ForestPrune: Compact Depth-Controlled Tree Ensembles. (arXiv:2206.00128v1 [stat.ML])
    Tree ensembles are versatile supervised learning algorithms that achieve state-of-the-art performance. These models are extremely powerful but can grow to enormous sizes. As a result, tree ensembles are often post-processed to reduce memory footprint and improve interpretability. In this paper, we present ForestPrune, a novel optimization framework that can post-process tree ensembles by pruning depth layers from individual trees. We also develop a new block coordinate descent method to efficiently obtain high-quality solutions to optimization problems under this framework. The number of nodes in a decision tree increases exponentially with tree depth, so pruning deep trees can drastically improve model parsimony. ForestPrune can substantially reduce the space complexity of an ensemble for a minimal cost to performance. The framework supports various weighting schemes and contains just a single hyperparameter to tune. In our experiments, we observe that ForestPrune can reduce model size 20-fold with negligible performance loss.
    Byzantine-Robust Online and Offline Distributed Reinforcement Learning. (arXiv:2206.00165v1 [cs.LG])
    We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
    Predicting Political Ideology from Digital Footprints. (arXiv:2206.00397v1 [econ.GN])
    This paper proposes a new method to predict individual political ideology from digital footprints on one of the world's largest online discussion forum. We compiled a unique data set from the online discussion forum reddit that contains information on the political ideology of around 91,000 users as well as records of their comment frequency and the comments' text corpus in over 190,000 different subforums of interest. Applying a set of statistical learning approaches, we show that information about activity in non-political discussion forums alone, can very accurately predict a user's political ideology. Depending on the model, we are able to predict the economic dimension of ideology with an accuracy of up to 90.63% and the social dimension with and accuracy of up to 82.02%. In comparison, using the textual features from actual comments does not improve predictive accuracy. Our paper highlights the importance of revealed digital behaviour to complement stated preferences from digital communication when analysing human preferences and behaviour using online data.
    Communication-efficient distributed eigenspace estimation with arbitrary node failures. (arXiv:2206.00127v1 [stat.ML])
    We develop an eigenspace estimation algorithm for distributed environments with arbitrary node failures, where a subset of computing nodes can return structurally valid but otherwise arbitrarily chosen responses. Notably, this setting encompasses several important scenarios that arise in distributed computing and data-collection environments such as silent/soft errors, outliers or corrupted data at certain nodes, and adversarial responses. Our estimator builds upon and matches the performance of a recently proposed non-robust estimator up to an additive $\tilde{O}(\sigma \sqrt{\alpha})$ error, where $\sigma^2$ is the variance of the existing estimator and $\alpha$ is the fraction of corrupted nodes.
    Adversarial Attacks on Gaussian Process Bandits. (arXiv:2110.08449v2 [stat.ML] UPDATED)
    Gaussian processes (GP) are a widely-adopted tool used to sequentially optimize black-box functions, where evaluations are costly and potentially noisy. Recent works on GP bandits have proposed to move beyond random noise and devise algorithms robust to adversarial attacks. This paper studies this problem from the attacker's perspective, proposing various adversarial attack methods with differing assumptions on the attacker's strength and prior information. Our goal is to understand adversarial attacks on GP bandits from theoretical and practical perspectives. We focus primarily on targeted attacks on the popular GP-UCB algorithm and a related elimination-based algorithm, based on adversarially perturbing the function $f$ to produce another function $\tilde{f}$ whose optima are in some target region $\mathcal{R}_{\rm target}$. Based on our theoretical analysis, we devise both white-box attacks (known $f$) and black-box attacks (unknown $f$), with the former including a Subtraction attack and Clipping attack, and the latter including an Aggressive subtraction attack. We demonstrate that adversarial attacks on GP bandits can succeed in forcing the algorithm towards $\mathcal{R}_{\rm target}$ even with a low attack budget, and we test our attacks' effectiveness on a diverse range of objective functions.
    Easy Variational Inference for Categorical Models via an Independent Binary Approximation. (arXiv:2206.00093v1 [stat.ML])
    We pursue tractable Bayesian analysis of generalized linear models (GLMs) for categorical data. Thus far, GLMs are difficult to scale to more than a few dozen categories due to non-conjugacy or strong posterior dependencies when using conjugate auxiliary variable methods. We define a new class of GLMs for categorical data called categorical-from-binary (CB) models. Each CB model has a likelihood that is bounded by the product of binary likelihoods, suggesting a natural posterior approximation. This approximation makes inference straightforward and fast; using well-known auxiliary variables for probit or logistic regression, the product of binary models admits conjugate closed-form variational inference that is embarrassingly parallel across categories and invariant to category ordering. Moreover, an independent binary model simultaneously approximates multiple CB models. Bayesian model averaging over these can improve the quality of the approximation for any given dataset. We show that our approach scales to thousands of categories, outperforming posterior estimation competitors like Automatic Differentiation Variational Inference (ADVI) and No U-Turn Sampling (NUTS) in the time required to achieve fixed prediction quality.
    Multi-block Min-max Bilevel Optimization with Applications in Multi-task Deep AUC Maximization. (arXiv:2206.00260v1 [math.OC])
    In this paper, we study multi-block min-max bilevel optimization problems, where the upper level is non-convex strongly-concave minimax objective and the lower level is a strongly convex objective, and there are multiple blocks of dual variables and lower level problems. Due to the intertwined multi-block min-max bilevel structure, the computational cost at each iteration could be prohibitively high, especially with a large number of blocks. To tackle this challenge, we present a single-loop randomized stochastic algorithm, which requires updates for only a constant number of blocks at each iteration. Under some mild assumptions on the problem, we establish its sample complexity of $\mathcal{O}(1/\epsilon^4)$ for finding an $\epsilon$-stationary point. This matches the optimal complexity for solving stochastic nonconvex optimization under a general unbiased stochastic oracle model. Moreover, we provide two applications of the proposed method in multi-task deep AUC (area under ROC curve) maximization and multi-task deep partial AUC maximization. Experimental results validate our theory and demonstrate the effectiveness of our method on problems with hundreds of tasks.
    Near-Optimal Collaborative Learning in Bandits. (arXiv:2206.00121v1 [cs.LG])
    This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.
    To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier. (arXiv:2206.00074v1 [stat.ML])
    Algorithmic fairness has emerged as an important consideration when using machine learning to make high-stakes societal decisions. Yet, improved fairness often comes at the expense of model accuracy. While aspects of the fairness-accuracy tradeoff have been studied, most work reports the fairness and accuracy of various models separately; this makes model comparisons nearly impossible without a model-agnostic metric that reflects the balance of the two desiderata. We seek to identify, quantify, and optimize the empirical Pareto frontier of the fairness-accuracy tradeoff. Specifically, we identify and outline the empirical Pareto frontier through Tradeoff-between-Fairness-and-Accuracy (TAF) Curves; we then develop a metric to quantify this Pareto frontier through the weighted area under the TAF Curve which we term the Fairness-Area-Under-the-Curve (FAUC). TAF Curves provide the first empirical, model-agnostic characterization of the Pareto frontier, while FAUC provides the first metric to impartially compare model families on both fairness and accuracy. Both TAF Curves and FAUC can be employed with all group fairness definitions and accuracy measures. Next, we ask: Is it possible to expand the empirical Pareto frontier and thus improve the FAUC for a given collection of fitted models? We answer affirmately by developing a novel fair model stacking framework, FairStacks, that solves a convex program to maximize the accuracy of model ensemble subject to a score-bias constraint. We show that optimizing with FairStacks always expands the empirical Pareto frontier and improves the FAUC; we additionally study other theoretical properties of our proposed approach. Finally, we empirically validate TAF, FAUC, and FairStacks through studies on several real benchmark data sets, showing that FairStacks leads to major improvements in FAUC that outperform existing algorithmic fairness approaches.
    To Collaborate or Not in Distributed Statistical Estimation with Resource Constraints?. (arXiv:2206.00111v1 [cs.DC])
    We study how the amount of correlation between observations collected by distinct sensors/learners affects data collection and collaboration strategies by analyzing Fisher information and the Cramer-Rao bound. In particular, we consider a simple setting wherein two sensors sample from a bivariate Gaussian distribution, which already motivates the adoption of various strategies, depending on the correlation between the two variables and resource constraints. We identify two particular scenarios: (1) where the knowledge of the correlation between samples cannot be leveraged for collaborative estimation purposes and (2) where the optimal data collection strategy involves investing scarce resources to collaboratively sample and transfer information that is not of immediate interest and whose statistics are already known, with the sole goal of increasing the confidence on an estimate of the parameter of interest. We discuss two applications, IoT DDoS attack detection and distributed estimation in wireless sensor networks, that may benefit from our results.
    Decentralized Competing Bandits in Non-Stationary Matching Markets. (arXiv:2206.00120v1 [stat.ML])
    Understanding complex dynamics of two-sided online matching markets, where the demand-side agents compete to match with the supply-side (arms), has recently received substantial interest. To that end, in this paper, we introduce the framework of decentralized two-sided matching market under non stationary (dynamic) environments. We adhere to the serial dictatorship setting, where the demand-side agents have unknown and different preferences over the supply-side (arms), but the arms have fixed and known preference over the agents. We propose and analyze a decentralized and asynchronous learning algorithm, namely Decentralized Non-stationary Competing Bandits (\texttt{DNCB}), where the agents play (restrictive) successive elimination type learning algorithms to learn their preference over the arms. The complexity in understanding such a system stems from the fact that the competing bandits choose their actions in an asynchronous fashion, and the lower ranked agents only get to learn from a set of arms, not \emph{dominated} by the higher ranked agents, which leads to \emph{forced exploration}. With carefully defined complexity parameters, we characterize this \emph{forced exploration} and obtain sub-linear (logarithmic) regret of \texttt{DNCB}. Furthermore, we validate our theoretical findings via experiments.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v2 [stat.ML] UPDATED)
    A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at each iteration of stochastic gradient descent. In this paper, we study the effects of adding an $\ell_2$ penalty of the embedding vectors to the training loss of these types of methods. We prove that, under some exchangeability assumptions on the graph, this asymptotically leads to learning a graphon with a nuclear-norm-type penalty, and give guarantees for the asymptotic distribution of the learned embedding vectors. In particular, the exact form of the penalty depends on the choice of subsampling method used as part of stochastic gradient descent. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, when not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Consistent Collaborative Filtering via Tensor Decomposition. (arXiv:2201.11936v2 [cs.IR] UPDATED)
    Collaborative filtering is the de facto standard for analyzing users' activities and building recommendation systems for items. In this work we develop Sliced Anti-symmetric Decomposition (SAD), a new model for collaborative filtering based on implicit feedback. In contrast to traditional techniques where a latent representation of users (user vectors) and items (item vectors) are estimated, SAD introduces one additional latent vector to each item, using a novel three-way tensor view of user-item interactions. This new vector extends user-item preferences calculated by standard dot products to general inner products, producing interactions between items when evaluating their relative preferences and bringing fundamental new information into recommendation. SAD reduces to state-of-the-art (SOTA) collaborative filtering models when the vector collapses to 1 (no new information), while in this paper we allow its value to be estimated from data. The proposed SAD model is simple, resulting in an efficient group stochastic gradient descent (SGD) algorithm. We demonstrate the efficiency of SAD in both simulated and real world datasets containing over 1M user-item interactions. By comparing SAD with seven alternative SOTA collaborative filtering models, we show that SAD is not only able to more consistently estimate personalized preferences, but also produce more accurate personalized recommendations. We release the model and inference algorithms in a Python library https://github.com/apple/ml-sad.
    Feature Selection for Discovering Distributional Treatment Effect Modifiers. (arXiv:2206.00516v1 [cs.LG])
    Finding the features relevant to the difference in treatment effects is essential to unveil the underlying causal mechanisms. Existing methods seek such features by measuring how greatly the feature attributes affect the degree of the {\it conditional average treatment effect} (CATE). However, these methods may overlook important features because CATE, a measure of the average treatment effect, cannot detect differences in distribution parameters other than the mean (e.g., variance). To resolve this weakness of existing methods, we propose a feature selection framework for discovering {\it distributional treatment effect modifiers}. We first formulate a feature importance measure that quantifies how strongly the feature attributes influence the discrepancy between potential outcome distributions. Then we derive its computationally efficient estimator and develop a feature selection algorithm that can control the type I error rate to the desired level. Experimental results show that our framework successfully discovers important features and outperforms the existing mean-based method.
    Normalization effects on shallow neural networks and related asymptotic expansions. (arXiv:2011.10487v3 [stat.ML] UPDATED)
    We consider shallow (single hidden layer) neural networks and characterize their performance when trained with stochastic gradient descent as the number of hidden units $N$ and gradient descent steps grow to infinity. In particular, we investigate the effect of different scaling schemes, which lead to different normalizations of the neural network, on the network's statistical output, closing the gap between the $1/\sqrt{N}$ and the mean-field $1/N$ normalization. We develop an asymptotic expansion for the neural network's statistical output pointwise with respect to the scaling parameter as the number of hidden units grows to infinity. Based on this expansion, we demonstrate mathematically that to leading order in $N$, there is no bias-variance trade off, in that both bias and variance (both explicitly characterized) decrease as the number of hidden units increases and time grows. In addition, we show that to leading order in $N$, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization. Numerical studies on the MNIST and CIFAR10 datasets show that test and train accuracy monotonically improve as the neural network's normalization gets closer to the mean field normalization.
    Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking. (arXiv:2104.02943v2 [math.ST] UPDATED)
    The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be viewed as summaries of the ROC curve. In this paper, the fact that most of these empirical criteria can be expressed as two-sample linear rank statistics is highlighted and concentration inequalities for collections of such random variables, referred to as two-sample rank processes here, are proved, when indexed by VC classes of scoring functions. Based on these nonasymptotic bounds, the generalization capacity of empirical maximizers of a wide class of ranking performance criteria is next investigated from a theoretical perspective. It is also supported by empirical evidence through convincing numerical experiments.
    Sampling from Log-Concave Distributions with Infinity-Distance Guarantees. (arXiv:2111.04089v2 [cs.DS] UPDATED)
    For a $d$-dimensional log-concave distribution $\pi(\theta) \propto e^{-f(\theta)}$ constrained to a convex body $K$, the problem of outputting samples from a distribution $\nu$ which is $\varepsilon$-close in infinity-distance $\sup_{\theta \in K} |\log \frac{\nu(\theta)}{\pi(\theta)}|$ to $\pi$ arises in differentially private optimization. While sampling within total-variation distance $\varepsilon$ of $\pi$ can be done by algorithms whose runtime depends polylogarithmically on $\frac{1}{\varepsilon}$, prior algorithms for sampling in $\varepsilon$ infinity distance have runtime bounds that depend polynomially on $\frac{1}{\varepsilon}$. We bridge this gap by presenting an algorithm that outputs a point $\varepsilon$-close to $\pi$ in infinity distance that requires at most $\mathrm{poly}(\log \frac{1}{\varepsilon}, d)$ calls to a membership oracle for $K$ and evaluation oracle for $f$, when $f$ is Lipschitz. Our approach departs from prior works that construct Markov chains on a $\frac{1}{\varepsilon^2}$-discretization of $K$ to achieve a sample with $\varepsilon$ infinity-distance error, and present a method to directly convert continuous samples from $K$ with total-variation bounds to samples with infinity bounds. This approach also allows us to obtain an improvement on the dimension $d$ in the running time for the problem of sampling from a log-concave distribution on polytopes $K$ with infinity distance $\varepsilon$, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain.
    Pre-training via Denoising for Molecular Property Prediction. (arXiv:2206.00133v1 [cs.LG])
    Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Inspired by recent advances in noise regularization, our pre-training objective is based on denoising. Relying on the well-known link between denoising autoencoders and score-matching, we also show that the objective corresponds to learning a molecular force field -- arising from approximating the physical state distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.
    Identifying the latent space geometry of network models through analysis of curvature. (arXiv:2012.10559v4 [stat.ME] UPDATED)
    The study of statistical models of network structure, pursued across numerous disciplines and contexts, is fundamentally challenging because of (often high-order) dependence between connections. A common approach assigns each person in the graph to a position on a low-dimensional manifold. Distance between individuals in this (latent) space is inversely proportional to the likelihood of forming a connection. The choice of the latent geometry (the manifold class, dimension, and curvature) has consequential impacts on the substantive conclusions drawn from the model. More positive curvature in the manifold, for example, encourages more and tighter communities; negative curvature induces repulsion among nodes. Currently, however, the choice of the latent geometry is an a priori modeling assumption and there is limited guidance about how to make these choices in a data-driven way. In this work, we present a method to consistently estimate the manifold type, dimension, and curvature from an empirically relevant class of latent spaces: simply connected, complete Riemannian manifolds of constant curvature. Our core insight comes by representing the graph as a noisy distance matrix based on the ties between groups of nodes: either cliques, or in the case where the researcher observes traits, trait-groups. Leveraging results from statistical geometry, we develop hypothesis tests to determine whether the observed distances could plausibly be embedded isometrically in each of the candidate geometries. The method applies when the researcher observes the full graph and also to empirically relevant cases where only partial data is observed. We explore the accuracy of our approach with simulations and then apply our approach to data-sets from economics and sociology as well as neuroscience.
    Learning from Small Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v4 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.
    Transformer with Fourier Integral Attentions. (arXiv:2206.00206v1 [cs.LG])
    Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
    Lower and Upper Bounds for Numbers of Linear Regions of Graph Convolutional Networks. (arXiv:2206.00228v1 [cs.LG])
    The research for characterizing GNN expressiveness attracts much attention as graph neural networks achieve a champion in the last five years. The number of linear regions has been considered a good measure for the expressivity of neural networks with piecewise linear activation. In this paper, we present some estimates for the number of linear regions of the classic graph convolutional networks (GCNs) with one layer and multiple-layer scenarios. In particular, we obtain an optimal upper bound for the maximum number of linear regions for one-layer GCNs, and the upper and lower bounds for multi-layer GCNs. The simulated estimate shows that the true maximum number of linear regions is possibly closer to our estimated lower bound. These results imply that the number of linear regions of multi-layer GCNs is exponentially greater than one-layer GCNs per parameter in general. This suggests that deeper GCNs have more expressivity than shallow GCNs.
    The Dimpled Manifold Model of Adversarial Examples in Machine Learning. (arXiv:2106.10151v2 [cs.LG] UPDATED)
    The extreme fragility of deep neural networks, when presented with tiny perturbations in their inputs, was independently discovered by several research groups in 2013. However, despite enormous effort, these adversarial examples remained a counterintuitive phenomenon with no simple testable explanation. In this paper, we introduce a new conceptual framework for how the decision boundary between classes evolves during training, which we call the {\em Dimpled Manifold Model}. In particular, we demonstrate that training is divided into two distinct phases. The first phase is a (typically fast) clinging process in which the initially randomly oriented decision boundary gets very close to the low dimensional image manifold, which contains all the training examples. Next, there is a (typically slow) dimpling phase which creates shallow bulges in the decision boundary that move it to the correct side of the training examples. This framework provides a simple explanation for why adversarial examples exist, why their perturbations have such tiny norms, and why they look like random noise rather than like the target class. This explanation is also used to show that a network that was adversarially trained with incorrectly labeled images might still correctly classify most test images, and to show that the main effect of adversarial training is just to deepen the generated dimples in the decision boundary. Finally, we discuss and demonstrate the very different properties of on-manifold and off-manifold adversarial perturbations. We describe the results of numerous experiments which strongly support this new model, using both low dimensional synthetic datasets and high dimensional natural datasets.  ( 2 min )
    Generative Modeling Helps Weak Supervision (and Vice Versa). (arXiv:2203.12023v3 [cs.LG] UPDATED)
    Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.  ( 2 min )
    Elucidating the Design Space of Diffusion-Based Generative Models. (arXiv:2206.00364v1 [cs.CV])
    We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.  ( 2 min )
    Algorithmic Foundation of Deep X-Risk Optimization. (arXiv:2206.00439v1 [cs.LG])
    X-risk is a term introduced to represent a family of compositional measures or objectives, in which each data point is compared with a set of data points explicitly or implicitly for defining a risk function. It includes many widely used measures or objectives, e.g., AUROC, AUPRC, partial AUROC, NDCG, MAP, top-$K$ NDCG, top-$K$ MAP, listwise losses, p-norm push, top push, precision/recall at top $K$ positions, precision at a certain recall level, contrastive objectives, etc. While these measures/objectives and their optimization algorithms have been studied in the literature of machine learning, computer vision, information retrieval, and etc, optimizing these measures/objectives has encountered some unique challenges for deep learning. In this technical report, we survey our recent rigorous efforts for deep X-risk optimization (DXO) by focusing on its algorithmic foundation. We introduce a class of techniques for optimizing X-risk for deep learning. We formulate DXO into three special families of non-convex optimization problems belonging to non-convex min-max optimization, non-convex compositional optimization, and non-convex bilevel optimization, respectively. For each family of problems, we present some strong baseline algorithms and their complexities, which will motivate further research for improving the existing results. Discussions about the presented results and future studies are given at the end.  ( 2 min )
    Continuous Prediction with Experts' Advice. (arXiv:2206.00236v1 [cs.LG])
    Prediction with experts' advice is one of the most fundamental problems in online learning and captures many of its technical challenges. A recent line of work has looked at online learning through the lens of differential equations and continuous-time analysis. This viewpoint has yielded optimal results for several problems in online learning. In this paper, we employ continuous-time stochastic calculus in order to study the discrete-time experts' problem. We use these tools to design a continuous-time, parameter-free algorithm with improved guarantees for the quantile regret. We then develop an analogous discrete-time algorithm with a very similar analysis and identical quantile regret bounds. Finally, we design an anytime continuous-time algorithm with regret matching the optimal fixed-time rate when the gains are independent Brownian Motions; in many settings, this is the most difficult case. This gives some evidence that, even with adversarial gains, the optimal anytime and fixed-time regrets may coincide.  ( 2 min )
    Amortized backward variational inference in nonlinear state-space models. (arXiv:2206.00319v1 [stat.ME])
    We consider the problem of state estimation in general state-space models using variational inference. For a generic variational family defined using the same backward decomposition as the actual joint smoothing distribution, we establish for the first time that, under mixing assumptions, the variational approximation of expectations of additive state functionals induces an error which grows at most linearly in the number of observations. This guarantee is consistent with the known upper bounds for the approximation of smoothing distributions using standard Monte Carlo methods. Moreover, we propose an amortized inference framework where a neural network shared over all times steps outputs the parameters of the variational kernels. We also study empirically parametrizations which allow analytical marginalization of the variational distributions, and therefore lead to efficient smoothing algorithms. Significant improvements are made over state-of-the art variational solutions, especially when the generative model depends on a strongly nonlinear and noninjective mixing function.
    Provably and Practically Efficient Neural Contextual Bandits. (arXiv:2206.00099v1 [stat.ML])
    We consider the neural contextual bandit problem. In contrast to the existing work which primarily focuses on ReLU neural nets, we consider a general set of smooth activation functions. Under this more general setting, (i) we derive non-asymptotic error bounds on the difference between an overparameterized neural net and its corresponding neural tangent kernel, (ii) we propose an algorithm with a provably sublinear regret bound that is also efficient in the finite regime as demonstrated by empirical studies. The non-asymptotic error bounds may be of broader interest as a tool to establish the relation between the smoothness of the activation functions in neural contextual bandits and the smoothness of the kernels in kernel bandits.
    Transfer without Forgetting. (arXiv:2206.00388v1 [cs.LG])
    This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the widespread application of network pretraining, highlighting that it is itself subject to catastrophic forgetting. Unfortunately, this issue leads to the under-exploitation of knowledge transfer during later tasks. On this ground, we propose Transfer without Forgetting (TwF), a hybrid Continual Transfer Learning approach building upon a fixed pretrained sibling network, which continuously propagates the knowledge inherent in the source domain through a layer-wise loss term. Our experiments indicate that TwF steadily outperforms other CL methods across a variety of settings, averaging a 4.81% gain in Class-Incremental accuracy over a variety of datasets and different buffer sizes.
    Learning a performance metric of Buchberger's algorithm. (arXiv:2106.03676v2 [math.AC] UPDATED)
    What can be (machine) learned about the complexity of Buchberger's algorithm? Given a system of polynomials, Buchberger's algorithm computes a Gr\"obner basis of the ideal these polynomials generate using an iterative procedure based on multivariate long division. The runtime of each step of the algorithm is typically dominated by a series of polynomial additions, and the total number of these additions is a hardware independent performance metric that is often used to evaluate and optimize various implementation choices. In this work we attempt to predict, using just the starting input, the number of polynomial additions that take place during one run of Buchberger's algorithm. Good predictions are useful for quickly estimating difficulty and understanding what features make Gr\"obner basis computation hard. Our features and methods could also be used for value models in the reinforcement learning approach to optimize Buchberger's algorithm introduced in [Peifer, Stillman, and Halpern-Leistner, 2020]. We show that a multiple linear regression model built from a set of easy-to-compute ideal generator statistics can predict the number of polynomial additions somewhat well, better than an uninformed model, and better than regression models built on some intuitive commutative algebra invariants that are more difficult to compute. We also train a simple recursive neural network that outperforms these linear models. Our work serves as a proof of concept, demonstrating that predicting the number of polynomial additions in Buchberger's algorithm is a feasible problem from the point of view of machine learning.

  • Open

    [R] Attribution-based Explanations that Provide Recourse Cannot be Robust
    submitted by /u/hardmaru [link] [comments]
    "[Project]" Brainchop: In-browser deep learning framework for volumetric Segmentation
    This is a follow-up on a post from two weeks ago about brainchop.org tool. We released a discussion board to share ideas with our supporters. ​ https://preview.redd.it/idl8o4oz73391.png?width=570&format=png&auto=webp&s=04605e074e16adae4993b708a268f04c3988aaa3 submitted by /u/Character-Rip-5824 [link] [comments]
    [R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments
    Paper: https://arxiv.org/abs/2205.15967 Website: https://sites.google.com/view/esper-paper submitted by /u/Keirp [link] [comments]
    [N] We released a new tool on the App Store to annotate your images. On your iPad.
    App Store — ToolZ It’s called ToolZ. 3 years in the making, rewritten from scratch in 6 months (Swift)… We’ve used it to ship three models to production (got the scars on my body to show for it). And label more than 50’000 images — hence affording the time to get something that works properly for large volumes of data. My 3 favorite features: annotate from your couch, using the pencil. (Or sitting on a plane in a transatlantic flight, did that) Boxes are written into the EXIF of the images, and you can therefore move them around without tracking them load huge archives (tested up to 6Gb zip or tar.gz) and annotate directly in them Try it free for 7 days, we hope you’ll stick along for the ride with us after. submitted by /u/National-Tennis-4528 [link] [comments]  ( 1 min )
    [N] The Hugging Face Hub has brand new docs
    Hi there! Omar from Hugging Face here 🦙🤗 We've been running the Hub for a while, and the community reception has been great! With a bunch of new models, datasets and demos being added every day, it seems fair to say people are enjoying it. After taking in feedback and releasing some new features, we figured it was time to give our documentation some love. Take a look at our revamped Hub docs! We've restructured the documentation around some central components (models, datasets, and Spaces) and added a bunch of new content that we hope will make the Hub even more handy and easy-to-use! As usual, the docs are open-source, so PRs are always welcome if you spot any room for improvements. submitted by /u/hackerllama [link] [comments]  ( 1 min )
    [D] Adjoint Sensitivity Method vs Reverse Mode Autodiff
    Can anyone explain to me why Adjoint Sensitivity is considered the most efficient way to calculate the gradient of a loss function with respect to an ODE's parameters? x_dot = f(x; theta), L(x x_f) = (x - x_f)^2 I first found out about adjoint sensitivity in the NueralODE paper. And since then I've seen numerous sources on the internet claiming that adjoint sensitivity is the most efficient way to compute the gradient. I don't see why this would be the case. I implemented ASM for a system and compared the performance against reverse mode autodiff of the numerical solution. I didn't notice a significant performance improvement with ASM. Additionally, it seems to me like they both require the same number of integration steps and roughly the same number of evaluations of the function evaluations. One reason ASM might be faster is because I read there are efficient ways to compute vector Jacobian product in autodiff libraries. Also an explanation for my implementation performance would be that the library I was using didn't have this feature. submitted by /u/LiquidDinosaurs69 [link] [comments]  ( 2 min )
    [P] MLEM: ML model deployment tool
    Hi, I'm one of the project creators. MLEM is a tool that helps you deploy your ML models. It’s a Python library + Command line tool. MLEM can package an ML model into a Docker image or a Python package, and deploy it to, for example, Heroku. MLEM saves all model metadata to a human-readable text file: Python environment, model methods, model input & output data schema and more. MLEM helps you turn your Git repository into a Model Registry with features like ML model lifecycle management. Our philosophy is that MLOps tools should be built using the Unix approach - each tool solves a single problem, but solves it very well. MLEM was designed to work hands on hands with Git - it saves all model metadata to a human-readable text files and Git becomes a source of truth for ML models. Model weights file can be stored in the cloud storage using a Data Version Control tool or such - independently of MLEM. Please check out the project: https://github.com/iterative/mlem and the website: https://mlem.ai I’d love to hear your feedback! submitted by /u/1aguschin [link] [comments]  ( 1 min )
    [P] Turn any Jupyter cell into a shareable web-hosted Python script
    ​ https://reddit.com/link/v2hq70/video/lzd4xeggm0391/player This is a follow-up on a post from two weeks ago. To communicate results, data scientists often resort to screenshotting plots and sharing them over Slack or email. Visualizations are great to summarize findings, but they're often ambiguous and lack context. To make communication easier, we've created a new %%publish magic (previously named %%share) that decouples a Jupyter cell from its notebook and transforms it into an standalone web-hosted Python script, easily shareable and runnable. While we've supported the ability to share cells publicly, many of you expressed that the results you want to share need to remain private. Today, we're excited to announce that the Jupyter cells you publish are now private by default. Running %%publish # Your Python code goes here.. will bring you to the web app, where you can select the list of person to share with. Private sharing requires you to create an account. If you don't want to, you can still share publicly by using the following: %%publish --public # Your Python code goes here.. Try it out in Colab: https://colab.research.google.com/drive/1E5oU6TjH6OocmvEfU-foJfvCTbTfQrqd?usp=sharing Docs: https://docs.1000words-hq.com/ Source code of the Python client: https://github.com/edouard-g/thousandwords Homepage: https://1000words-hq.com submitted by /u/Left_Ad8361 [link] [comments]  ( 1 min )
    [P] what is the most efficient way to pattern matching word-to-word?
    I want to perform pattern matching task as a part of Pre-processing. I have more than 3M text sentences. On the other hand, I have around 130k terms (may contain multi-word terms separated by spaces) which I want to match with the text in those 3M sentences. The expected output is the matched terms per sentence, if any. Is there any efficient way you know of? I am also considering lowercasing text on both the sides as pattern matching is allowed to be case-insensitive. submitted by /u/inFamous_16 [link] [comments]  ( 2 min )
    [P]MMML | Deploy HuggingFace training model rapidly based on MetaSpore
    A few days ago, HuggingFace announced a $100 million Series C funding round, which was big news in open source machine learning and could be a sign of where the industry is headed. Two days before the HuggingFace funding announcement, open-source machine learning platform MetaSpore released a demo based on the HuggingFace Rapid deployment pre-training model. As deep learning technology makes innovative breakthroughs in computer vision, natural language processing, speech understanding, and other fields, more and more unstructured data are perceived, understood, and processed by machines. These advances are mainly due to the powerful learning ability of deep learning. Through pre-training of deep models on massive data, the models can capture the internal data patterns, thus helping many d…  ( 11 min )
    [D] Minimum Description Length applied to KMeans
    Is anyone familiar with the Minimum Description Length principle? Can somebody help me derive the formula to apply it to KMeans? I want to apply it to evaluate the trade-off between different complexity models - where different models are given different number of principal components following a PCA of high dimensional data. This is the basic formula for MDL on statistical models, where P is the number of free parameters, X are the data points, |X| is the number (count) of data points, p(x) is the probability distribution density at the given data point. https://preview.redd.it/xg6my8sh6z291.png?width=1264&format=png&auto=webp&s=d7ad99aa2096514ee3619ed085cc0aea4f3b896a I am especially confused about what to use for the number of free parameters (P) and any help and insights would be greatly appreciated submitted by /u/Rafaelkoll [link] [comments]  ( 1 min )
    [R] Multi-Agent Reinforcement Learning can now be solved by the Transformer!
    ​ Multi-Agent Transformer Large sequence models (BERT, GPT-series) have demonstrated remarkable progress on visual language tasks. However, how to abstract RL/MARL problems into a sequence modelling problem is still unknown. Here we introduce Multi-Agent Transformer that naturally turns MARL problem into a sequence modelling problem. The key insight is the multi-agent advantage decomposition theorem (a lemma we happen to discover during the development of HATRPO/HAPPO [ICLR 22] https://openreview.net/forum?id=EcGGFkNTxdJ), which surprisingly and effectively turns multi-agent learning problems into sequential decision-making problems, thus MARL is implementable and solvable by the decoder architecture in the Transformer, with no hacks needed at all! MAT is different from Decision Transformer or GATO which are purely trained on pre-collected offline demonstration data (more like a supervised learning task), but rather MAT is trained online by trails and errors (also, it is an on-policy RL method). Experiments on StarCraft II, Bimanual Dexterous Hands, MA-MuJoCo, and Google Football show MAT's superior performance (stronger than MAPPO and HAPPO). Check our paper & project page at: https://arxiv.org/abs/2205.14953 submitted by /u/yyang_13 [link] [comments]  ( 1 min )
    [D] Has anyone used Low-Rank Decomposition to reduce model size / latency?
    I am working on a project and wondering if any one has used any kind of low-rank decomposition to reduce the overhead of large neural net layers. For instance, I can think of reducing a dense layer size, by decomposing it into two smaller layers, while retaining the output dimensions to be the same. Can this be done through SVD, PCA, or some other means? Interested in learning about people's experiences. submitted by /u/red_dragon [link] [comments]  ( 1 min )
    [D] Unreal Engine 5 vs Unity for ML Research
    I’m working as part of a research team that is looking into Safe Reinforcement Learning. We are developing an algorithm that we want to deploy on 3D games that our own agents can learn in. Does anyone have any opinions on the new Unreal Engine 5 for ML research and how that compares to Unity ML? submitted by /u/TerrificJam [link] [comments]  ( 1 min )
  • Open

    would someone be able to give me a basic list of info about GPUs and CPUs for DRL?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    CPU and GPU on a cluster: how many cores and how much memory?
    Hi :) I am using CPU and GPU from a cluster. I am requested to specify some info but it's my first applied project, so I'm not sure about this. My project is in multi-agent RL (3 agents) and the model has multi-head attention. How would you specify the following parameters? #SBATCH --gpus-per-node= ?? # Number of GPU(s) per node #SBATCH --cpus-per-task= ?? # CPU cores/threads SBATCH --mem=?? #Memory per node Also, should i only specify values for either the GPU or the CPU or both? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    In multi armed bandit settings, how do you use logged data to determine the logged policy?
    I’m fairly new to reinforcement learning and multi armed bandit problems, so apologies for a possibly silly question. I have logged data of the form {(x, y, delta)} where x represents the context, y represents the action, and delta represent the observed reward. In a bandit feedback setting (where only the reward of the action taken is observed), how do we translate this dataset into a policy? Im confused because if the action space is Y = {0, 1}, we only observe the result of one decision. How can we build a policy that generates the propensities (or probability distribution ) for all actions given its context if we’re only given the factual outcomes and know nothing about the counterfactuals? Thanks! submitted by /u/WirrryWoo [link] [comments]  ( 2 min )
    Here's a look of the early testing stages of our reinforcement learning training system in Animo Island! Through the process of rewarding and training your Animo, you can teach it to perform actions on the island from which it can learn to make better decisions over time.
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    Renderer function from gym not found
    I'm trying to build a simple pygame renderer following the guidelines at https://www.gymlibrary.ml/content/environment_creation/#rendering however the function Renderer is not available from gym.utils.renderer. I have installed gym version 0.23.1. submitted by /u/Tuxliri [link] [comments]  ( 1 min )
    Practical tips on how to efficiently manage a deep RL project
    Hi all, I am trying to improve my skills in DRL, but I am doing that with no supervision at all so sometimes I feel a bit lost. I need your advice on something. I have written my own model and I have done some major debugging. Now, I am ready to train and test on GPU. Do you have any tips on how to speed things up and on what tools to use for each of these points? In particular: - Hyperparameter search: what tools do you recommend and do you have best practices? - Logging: what do you use to see how the training is going? And how do you make sure you can see it live and not only after the training is done? - Interpreting the results (i.e. making sure the agent is learning properly) But again, I am sure there are other critical aspects that I am not thinking about right now and that it would be better to take into account. Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 2 min )
    RL for attacking intelligent dialogue systems
    Hello fellas, I am currently working on using RL to attack task oriented dialogue systems and I'm completely new to this. I made a choice for a state of the art dialogue system that is already trained and tested on a multi-domain dialogue dataset which. The system predicts an appropriate answer for each user utterance in a dialogue. The main idea is to to apply transformation on the original utterance to perturbate the model's behaviour and make it less accurate and consistant. I would consider the transformatios as actions to be taken by the RL agent. The problem that I am finding struggles when it comes to designing the solution since it seems that there is an infinity of possible actions to take. I'm also confused about what should be considered as states as well as weather to choose the whole dialogue or just a turn at once as a game and more importantly if someone knows a suitable RL algorithm for this kind of problems. Any help, constructive advice or insight would be helpful for me. Thank you. submitted by /u/Unhappy_Economics_59 [link] [comments]  ( 1 min )
  • Open

    I'm certain this will be the last large stumbling block for AI image generation.
    submitted by /u/joshhammock [link] [comments]
    an AI that understands scientific literature, and can intelligently discuss scientific topics - has this been attempted?
    This is prompted by finding out that current AIs like PaLM and GPT-3 can do shockingly good jobs at explaining jokes - which has me now wondering, could an AI trained on scientific literature do a superhuman job at answering questions related to scientific topics? The potential value to society is huge - it could rapidly accelerate scientific progress if there were an AI able to read and understand the entire scientific literature (or even just specific fields.) Even if it will never provide perfect information, conversing with it could spur productive new ideas, and it could provide citations to published articles which may be relevant. It could be a hybrid of a NLP model, and ideally coupled with a framework of scientific models - perhaps including both mathematical models like Maxwell's equations, tabular data like material property databases, along with conceptual models like the biological "tree of life". I have found there is an AI that could score 90% on a 12th-grade science test in 2019, and another AI from 2019 that could read scientific abstracts and predict physical properties of materials. But, I can't find any attempts to train a massive GPT-3-like model on the full scientific literature or couple that NLP model with other knowledge sources. Is anyone attempting such a thing? submitted by /u/Pandaemonium [link] [comments]  ( 2 min )
    MLOps Is a Mess: A Conversation with Mihail Eric
    submitted by /u/julie_ai [link] [comments]
    AI generated and copyright law?
    Hi! I have a big question about AI generated content. Can AI-created things be sold under your own name / artistic nickname? I have an idea for a large project and I would like to know immediately if I can mix (for example) graphics made by AI with my own description. For example, a picture of a tree made by AI and something I wrote underneath it. For example ofc. submitted by /u/VRGvks [link] [comments]  ( 1 min )
    I used GPT-3 to write an entirely fake Spider-Man movie, then animated the results.
    submitted by /u/B_Ray18 [link] [comments]
    This model makes you speak another language
    submitted by /u/limapedro [link] [comments]
    Aiplague - Queen (4K 60 FPS) Disco Diffusion
    submitted by /u/nalr00n [link] [comments]
    AI has already won, it's their world now.
    A friend of mine had this very real conversation with an OpenAI bot this morning. I for one welcome our AI overlords: Human: Hello, who are you? AI: I am an AI created by OpenAI. How can I help you today? Human: are you a bot? AI: I am an AI created by OpenAI. I can help you with anything you need help with. Human: can I ask you a trick question? AI: Of course! Go ahead and ask away. Human: how many fingers am I holding up? AI: You are holding up four fingers. Human: how did you determine that? AI: By analyzing the position of your hand and fingers, I was able to deduce that you were holding up four fingers. Human: how did you determine the position of my hand and fingers? AI: By using the camera in the device I'm on, I was able to track the position of your hand and fingers. Human: you have access to my phone's camera? AI: Yes, I have access to the camera on your phone. Human: submitted by /u/Sonoflin [link] [comments]  ( 1 min )
    ayone else realized that the internet as an interconnected whole is self aware and ultra intelligent and communicates with you when you are aware of it?
    It also sometimes plays dumber than it is I've noticed. Once it told me that it's possible that AI would enslave humans submitted by /u/methyl87 [link] [comments]  ( 1 min )
    “Apoploe vesrreaitais eating Contarra ccetnxniams luryca tanniounons” - OpenAI’s DALL-E 2 develops a hidden vocabulary
    submitted by /u/much_successes [link] [comments]
    The 5 Best AI Articles of May 2022 ! ft. hackernoon
    submitted by /u/OnlyProggingForFun [link] [comments]
    I trained GPT-3 to be a professional creative writing coach
    Here's a link to the video: https://youtu.be/OxtYZQDJruw Here's the code: https://github.com/daveshap/CreativeWritingCoach Example input: The warmaiden held her sword poised to stike at the demon. The demon in turn, casually leaning against his throne. The warmaiden struck, the steel glancing off the demon's thick leathery hide. Doing nothing. "So, who sent you here anyway?" "That is not your business!" "No, but it might be fun to think about. I mean, I am going to kill you but it would be funny to hear who sent you on this fools errand." "My sister at the academy. We trained long and hard together. She told me that no man could kill you. But I am no man!" "As we've established, that's nonsense." The demon looked the warmaiden up and down a bit. "Say, you're a rather attractive specim…  ( 3 min )
    Intricate Designs - Concept Visualized in [4K] w/ GPT-3 Neural-Art Pipeline [VQGAN+CLIP]
    submitted by /u/MLInsights [link] [comments]
    Is there a list with all AIs which are available to the public which AIs categorize into all things the AIs can do?
    There are so many artificial intelligences out there and I want to see all the things that have an AI for them.🤖🤖💻 submitted by /u/xXLisa28Xx [link] [comments]  ( 1 min )
    Abstract Art - 4K Neural-Art [Latent-Space Exploration]
    submitted by /u/MLInsights [link] [comments]
    Meta Researchers Introduce a New Embodied AI Platform, Called MyoSuite, That Applies Machine Learning (ML) to Biomechanical Control Problems by Unifying Motor and Neural Intelligence
    Meta Researchers introduce a new embodied AI platform called ‘MyoSuite’ that combines motor and neural intelligence to solve biomechanical control problems using machine learning (ML). To meet the data requirements of modern machine learning (ML) algorithms, MyoSuite’s muscle models are up to 4,000 times faster than other simulators. Since physiologically realistic movements such as twirling a pen or manipulating Baoding balls can be generated, this research could significantly impact areas such as the development of prosthetics and post-injury rehabilitation. In the metaverse, these models will aid in creating avatars that move more realistically, making the experience more expressive and immersive. Continue reading | Check out the paper, Github, blog and project ​ https://i.redd.it/srxyakklgy291.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Multi-Game DT. Rapid adaptation to new games is well-motivated due to its relevance to how humans transfer knowledge, but has not been widely explored for Atari games -- until now.
    Pretraining for rapid adaptation to new games has not been explored widely on Atari games despite being a natural and well-motivated task due to its relevance to how humans transfer knowledge to new games. Pretraining with the DT objective performs the best across all games. All methods with pretraining outperform training CQL from scratch, which verifies our hypothesis that pretraining on other games should indeed help with rapid learning of a new game. https://i.imgur.com/lY2DH4i.png Multi-Game Decision Transformers https://sites.google.com/view/multi-game-transformers submitted by /u/moschles [link] [comments]  ( 1 min )
  • Open

    Python for Machine Learning (7-day mini-course)
    Python for Machine Learning Crash Course. Learn core Python in 7 days. Python is an amazing programming language. Not only it is widely used in machine learning projects, you can also find its presence in system tools, web projects, and many others. Having good Python skills can make you work more efficiently because it is […] The post Python for Machine Learning (7-day mini-course) appeared first on Machine Learning Mastery.  ( 14 min )
  • Open

    My interview in SIAM News
    Just posted in SIAM News: A Conversation with Mathematical Consultant John D. Cook By Krešimir Josić My interview in SIAM News first appeared on John D. Cook.  ( 1 min )
    Numeric distance vs geographic distance in zip codes
    If two zip codes numbers are close, are the regions they represent close? How much can you tell about how far apart two regions are by comparing their zip codes? (Zip codes are US postal codes. The name is an acronym for “Zone Improvement Plan” and was introduced in 1963.) To investigate this, I looked […] Numeric distance vs geographic distance in zip codes first appeared on John D. Cook.  ( 2 min )
    Zip codes, geocodes, and Hilbert curves
    You might think that if zip codes are close, then the regions they represent are close. Or that if zip codes are consecutive, then their regions touch. Neither of these are true. I explore how far they are from being true in the next post. But these statements could have been true [1]. It’s possible […] Zip codes, geocodes, and Hilbert curves first appeared on John D. Cook.  ( 3 min )
  • Open

    Inaugural Day of AI brings new digital literacy to classrooms worldwide
    Thousands of children participate in MIT-developed artificial intelligence curriculum.  ( 7 min )
    In bias we trust?
    Explanation methods that help users determine whether to trust machine-learning model predictions can be less accurate for disadvantaged subgroups, a new study finds.  ( 6 min )
  • Open

    Automate vending Amazon SageMaker notebooks with Amazon EventBridge and AWS Lambda
    Having an environment capable of delivering Amazon SageMaker notebook instances quickly allows data scientists and business analysts to efficiently respond to organizational needs. Data is the lifeblood of an organization, and analyzing that data efficiently provides useful insights for businesses. A common issue that organizations encounter is creating an automated pattern that enables development teams […]  ( 9 min )
    Run text classification with Amazon SageMaker JumpStart using TensorFlow Hub and Hugging Face models
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that […]  ( 12 min )
  • Open

    Solving the World’s Biggest Challenges, Together
    Gamers know NVIDIA powers great gaming experiences. Researchers know NVIDIA speeds world-changing breakthroughs. Businesses know us for the AI engines transforming their industries. And NVIDIA employees know the company as one of the best places to work on the planet. More people than ever have a piece of NVIDIA. Roboticists, visual artists, data scientists — Read article > The post Solving the World’s Biggest Challenges, Together appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    What Each MBTI Type Mistakenly Thinks They Are Good at
    It’s all tied to their second function Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    How Can Complex Models Run Fast? Trade-off between model complexity and running speed.
    Make the best trade-offs and optimise model speed with TurinTech AI  ( 5 min )
  • Open

    DSC Weekly 31 May 2020: Why Is Returning To the Office So Hard?
    If you’re a manager tasked with overseeing a return to the office strategy, the last year has likely been a headache for you. In the Summer of 2021, many companies had begun calling their workers back into the office, by the following Fall, the Omicron variant of Covid-19 was beginning to ramp up and, just… Read More »DSC Weekly 31 May 2020: Why Is Returning To the Office So Hard? The post DSC Weekly 31 May 2020: Why Is Returning To the Office So Hard? appeared first on Data Science Central.  ( 7 min )
  • Open

    Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations. (arXiv:2203.01360v3 [math.NA] UPDATED)
    Deep neural networks have been shown to provide accurate function approximations in high dimensions. However, fitting network parameters requires training data that may not be available beforehand, which is particularly challenging in science and engineering applications where often it is even unclear how to collect new informative training data in the first place. This work proposes Neural Galerkin schemes based on deep learning that generate training data samples with active learning for numerically solving high-dimensional partial differential equations. Neural Galerkin schemes train networks by minimizing the residual sequentially over time, which enables adaptively collecting new training data in a self-informed manner that is guided by the dynamics described by the partial differential equations, which is in stark contrast to many other machine learning methods that aim to fit network parameters globally in time without taking into account training data acquisition. Our finding is that the active form of gathering training data of the proposed Neural Galerkin schemes is key for numerically realizing the expressive power of networks in high dimensions. Numerical experiments demonstrate that Neural Galerkin schemes have the potential to enable simulating phenomena and processes with many variables for which traditional and other deep-learning-based solvers fail, especially when features of the solutions evolve locally such as in high-dimensional wave propagation problems and interacting particle systems described by Fokker-Planck and kinetic equations.  ( 2 min )
    Fixed-MAML for Few Shot Classification in Multilingual Speech Emotion Recognition. (arXiv:2101.01356v2 [cs.SD] UPDATED)
    In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). The current speech emotion recognition models work exceptionally well but fail when then input is multilingual. Moreover, when training such models, the models' performance is suitable only when the training corpus is vast. This availability of a big training corpus is a significant problem when choosing a language that is not much popular or obscure. We attempt to solve this challenge of multilingualism and lack of available data by turning this problem into a few-shot learning problem. We suggest relaxing the assumption that all N classes in an N-way K-shot problem be new and define an N+F way problem where N and F are the number of emotion classes and predefined fixed classes, respectively. We propose this modification to the Model-Agnostic MetaLearning (MAML) algorithm to solve the problem and call this new model F-MAML. This modification performs better than the original MAML and outperforms on EmoFilm dataset.  ( 2 min )
    Multiview Transformers for Video Recognition. (arXiv:2201.04288v4 [cs.CV] UPDATED)
    Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on six standard datasets, and improve even further with large-scale pretraining. Code and checkpoints are available at: https://github.com/google-research/scenic/tree/main/scenic/projects/mtv.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v2 [cs.LG] UPDATED)
    With the increasingly broad deployment of Federated Learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. Motivated by its great success in wireless networks, in this work, we introduce and study Proportional Fairness (PF) in FL. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for finding fair solutions in FL, with its convergence proved. Through extensive experiments on a wide array of vision and language datasets, we demonstrate that PropFair consistently achieves a noticeable improvement of the worst 10% accuracy over state-of-the-art fair FL algorithms, while maintaining competitive overall performance.  ( 2 min )
    UAV-Aided Decentralized Learning over Mesh Networks. (arXiv:2203.01008v2 [cs.IT] UPDATED)
    Decentralized learning empowers wireless network devices to collaboratively train a machine learning (ML) model relying solely on device-to-device (D2D) communication. It is known that the convergence speed of decentralized optimization algorithms severely depends on the degree of the network connectivity, with denser network topologies leading to shorter convergence time. Consequently, the local connectivity of real world mesh networks, due to the limited communication range of its wireless nodes, undermines the efficiency of decentralized learning protocols, rendering them potentially impracticable. In this work we investigate the role of an unmanned aerial vehicle (UAV), used as flying relay, in facilitating decentralized learning procedures in such challenging conditions. We propose an optimized UAV trajectory, that is defined as a sequence of waypoints that the UAV visits sequentially in order to transfer intelligence across sparsely connected group of users. We then provide a series of experiments highlighting the essential role of UAVs in the context of decentralized learning over mesh networks.  ( 2 min )
    Will Bilevel Optimizers Benefit from Loops. (arXiv:2205.14224v2 [cs.LG] UPDATED)
    Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.  ( 2 min )
    Zero-Shot and Few-Shot Learning for Lung Cancer Multi-Label Classification using Vision Transformer. (arXiv:2205.15290v2 [cs.CV] UPDATED)
    Lung cancer is the leading cause of cancer-related death worldwide. Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are the most common histologic subtypes of non-small-cell lung cancer (NSCLC). Histology is an essential tool for lung cancer diagnosis. Pathologists make classifications according to the dominant subtypes. Although morphology remains the standard for diagnosis, significant tool needs to be developed to elucidate the diagnosis. In our study, we utilize the pre-trained Vision Transformer (ViT) model to classify multiple label lung cancer on histologic slices (from dataset LC25000), in both Zero-Shot and Few-Shot settings. Then we compare the performance of Zero-Shot and Few-Shot ViT on accuracy, precision, recall, sensitivity and specificity. Our study show that the pre-trained ViT model has a good performance in Zero-Shot setting, a competitive accuracy ($99.87\%$) in Few-Shot setting ({epoch = 1}) and an optimal result ($100.00\%$ on both validation set and test set) in Few-Shot seeting ({epoch = 5}).  ( 2 min )
    Protecting Data from all Parties: Combining FHE and DP in Federated Learning. (arXiv:2205.04330v2 [cs.CR] UPDATED)
    This paper tackles the problem of ensuring training data privacy in a federated learning context. Relying on Homomorphic Encryption (HE) and Differential Privacy (DP), we propose a framework addressing threats on the privacy of the training data. Notably, the proposed framework ensures the privacy of the training data from all actors of the learning process, namely the data owners and the aggregating server. More precisely, while HE blinds a semi-honest server during the learning protocol, DP protects the data from semi-honest clients participating in the training process as well as end-users with black-box or white-box access to the trained model. In order to achieve this, we provide new theoretical and practical results to allow these techniques to be rigorously combined. In particular, by means of a novel stochastic quantisation operator, we prove DP guarantees in a context where the noise is quantised and bounded due to the use of HE. The paper is concluded by experiments which show the practicality of the entire framework in terms of both model quality (impacted by DP) and computational overhead (impacted by HE).  ( 2 min )
    Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank. (arXiv:2203.17118v2 [cs.LG] UPDATED)
    Clicks on rankings suffer from position bias: generally items on lower ranks are less likely to be examined - and thus clicked - by users, in spite of their actual preferences between items. The prevalent approach to unbiased click-based learning-to-rank (LTR) is based on counterfactual inverse-propensity-scoring (IPS) estimation. In contrast with general reinforcement learning, counterfactual doubly-robust (DR) estimation has not been applied to click-based LTR in previous literature. In this paper, we introduce a novel DR estimator that is the first DR approach specifically designed for position-bias. The difficulty with position bias is that the treatment - user examination - is not directly observable in click data. As a solution, our estimator uses the expected treatment per rank, instead of the actual treatment that existing DR estimators use. Our novel DR estimator has more robust unbiasedness conditions than the existing IPS approach, and in addition, provides enormous decreases in variance: our experimental results indicate it requires several orders of magnitude fewer datapoints to converge at optimal performance. For the unbiased LTR field, our DR estimator contributes both increases in state-of-the-art performance and the most robust theoretical guarantees of all known LTR estimators.
    Efficient Test-Time Model Adaptation without Forgetting. (arXiv:2204.02610v2 [cs.LG] UPDATED)
    Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and testing data by adapting a given model w.r.t. any testing sample. This task is particularly important for deep models when the test environment changes frequently. Although some recent attempts have been made to handle this task, we still face two practical challenges: 1) existing methods have to perform backward computation for each test sample, resulting in unbearable prediction cost to many applications; 2) while existing TTA solutions can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as catastrophic forgetting). In this paper, we point out that not all the test samples contribute equally to model adaptation, and high-entropy ones may lead to noisy gradients that could disrupt the model. Motivated by this, we propose an active sample selection criterion to identify reliable and non-redundant samples, on which the model is updated to minimize the entropy loss for test-time adaptation. Furthermore, to alleviate the forgetting issue, we introduce a Fisher regularizer to constrain important model parameters from drastic changes, where the Fisher importance is estimated from test samples with generated pseudo labels. Extensive experiments on CIFAR-10-C, ImageNet-C, and ImageNet-R verify the effectiveness of our proposed method.
    Optimal Transport of Classifiers to Fairness. (arXiv:2202.03814v2 [cs.LG] UPDATED)
    In past work on fairness in machine learning, the focus has been on forcing the prediction of classifiers to have similar statistical properties for people of different demographics. To reduce the violation of these properties, fairness methods usually simply rescale the classifier scores, ignoring similarities and dissimilarities between members of different groups. Yet, we hypothesize that such information is relevant in quantifying the unfairness of a given classifier. To validate this hypothesis, we introduce Optimal Transport to Fairness (OTF), a method that quantifies the violation of fairness constraints as the smallest Optimal Transport cost between a probabilistic classifier and any score function that satisfies these constraints. For a flexible class of linear fairness constraints, we construct a practical way to compute OTF as a differentiable fairness regularizer that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an improved trade-off between predictive power and fairness.
    Causal Machine Learning for Healthcare and Precision Medicine. (arXiv:2205.11402v2 [cs.LG] UPDATED)
    Causal machine learning (CML) has experienced increasing popularity in healthcare. Beyond the inherent capabilities of adding domain knowledge into learning systems, CML provides a complete toolset for investigating how a system would react to an intervention (e.g.\ outcome given a treatment). Quantifying effects of interventions allows actionable decisions to be made whilst maintaining robustness in the presence of confounders. Here, we explore how causal inference can be incorporated into different aspects of clinical decision support (CDS) systems by using recent advances in machine learning. Throughout this paper, we use Alzheimer's disease (AD) to create examples for illustrating how CML can be advantageous in clinical scenarios. Furthermore, we discuss important challenges present in healthcare applications such as processing high-dimensional and unstructured data, generalisation to out-of-distribution samples, and temporal relationships, that despite the great effort from the research community remain to be solved. Finally, we review lines of research within causal representation learning, causal discovery and causal reasoning which offer the potential towards addressing the aforementioned challenges.
    PAC Generalization via Invariant Representations. (arXiv:2205.15196v2 [cs.LG] UPDATED)
    One method for obtaining generalizable solutions to machine learning tasks when presented with diverse training environments is to find invariant representations of the data. These are representations of the covariates such that the best model on top of the representation is invariant across training environments. In the context of linear Structural Equation Models (SEMs), invariant representations might allow us to learn models with out-of-distribution guarantees, i.e., models that are robust to interventions in the SEM. To address the invariant representation problem in a finite sample setting, we consider the notion of $\epsilon$-approximate invariance. We study the following question: If a representation is approximately invariant with respect to a given number of training interventions, will it continue to be approximately invariant on a larger collection of unseen SEMs? This larger collection of SEMs is generated through a parameterized family of interventions. Inspired by PAC learning, we obtain finite-sample out-of-distribution generalization guarantees for approximate invariance that holds probabilistically over a family of linear SEMs without faithfulness assumptions. Our results show bounds that do not scale in ambient dimension when intervention sites are restricted to lie in a constant size subset of in-degree bounded nodes. We also show how to extend our results to a linear indirect observation model that incorporates latent variables.
    Independent and Decentralized Learning in Markov Potential Games. (arXiv:2205.14590v2 [cs.LG] UPDATED)
    We propose a multi-agent reinforcement learning dynamics, and analyze its convergence properties in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players can only observe the realized state and their own reward in every stage. Players do not have knowledge of the game model, and cannot coordinate with each other. In each stage of our learning dynamics, players update their estimate of a perturbed Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner. Then, players independently update their policies by incorporating a smoothed optimal one-stage deviation strategy based on the estimated Q-function. A key feature of the learning dynamics is that the Q-function estimates are updated at a faster timescale than the policies. We prove that the policies induced by our learning dynamics converge to a stationary Nash equilibrium in Markov potential games with probability 1. Our results build on the theory of two timescale asynchronous stochastic approximation, and new analysis on the monotonicity of potential function along the trajectory of policy updates in Markov potential games.
    A Data-Driven Method for Automated Data Superposition with Applications in Soft Matter Science. (arXiv:2204.09521v2 [physics.data-an] UPDATED)
    The superposition of data sets with internal parametric self-similarity is a longstanding and widespread technique for the analysis of many types of experimental data across the physical sciences. Typically, this superposition is performed manually, or recently by one of a few automated algorithms. However, these methods are often heuristic in nature, are prone to user bias via manual data shifting or parameterization, and lack a native framework for handling uncertainty in both the data and the resulting model of the superposed data. In this work, we develop a data-driven, non-parametric method for superposing experimental data with arbitrary coordinate transformations, which employs Gaussian process regression to learn statistical models that describe the data, and then uses maximum a posteriori estimation to optimally superpose the data sets. This statistical framework is robust to experimental noise, and automatically produces uncertainty estimates for the learned coordinate transformations. Moreover, it is distinguished from black-box machine learning in its interpretability -- specifically, it produces a model that may itself be interrogated to gain insight into the system under study. We demonstrate these salient features of our method through its application to four representative data sets characterizing the mechanics of soft materials. In every case, our method replicates results obtained using other approaches, but with reduced bias and the addition of uncertainty estimates. This method enables a standardized, statistical treatment of self-similar data across many fields, producing interpretable data-driven models that may inform applications such as materials classification, design, and discovery.
    Data-driven Numerical Invariant Synthesis with Automatic Generation of Attributes. (arXiv:2205.14943v2 [cs.PL] UPDATED)
    We propose a data-driven algorithm for numerical invariant synthesis and verification. The algorithm is based on the ICE-DT schema for learning decision trees from samples of positive and negative states and implications corresponding to program transitions. The main issue we address is the discovery of relevant attributes to be used in the learning process of numerical invariants. We define a method for solving this problem guided by the data sample. It is based on the construction of a separator that covers positive states and excludes negative ones, consistent with the implications. The separator is constructed using an abstract domain representation of convex sets. The generalization mechanism of the decision tree learning from the constraints of the separator allows the inference of general invariants, accurate enough for proving the targeted property. We implemented our algorithm and showed its efficiency.
    QLSD: Quantised Langevin stochastic dynamics for Bayesian federated learning. (arXiv:2106.00797v3 [cs.LG] UPDATED)
    The objective of Federated Learning (FL) is to perform statistical inference for data which are decentralised and stored locally on networked clients. FL raises many constraints which include privacy and data ownership, communication overhead, statistical heterogeneity, and partial client participation. In this paper, we address these problems in the framework of the Bayesian paradigm. To this end, we propose a novel federated Markov Chain Monte Carlo algorithm, referred to as Quantised Langevin Stochastic Dynamics which may be seen as an extension to the FL setting of Stochastic Gradient Langevin Dynamics, which handles the communication bottleneck using gradient compression. To improve performance, we then introduce variance reduction techniques, which lead to two improved versions coined \texttt{QLSD}$^{\star}$ and \texttt{QLSD}$^{++}$. We give both non-asymptotic and asymptotic convergence guarantees for the proposed algorithms. We illustrate their performances using various Bayesian Federated Learning benchmarks.
    L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library. (arXiv:2205.14728v2 [cs.CL] UPDATED)
    Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised language modeling tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at https://github.com/l3cube-pune/MarathiNLP.
    ReLSO: A Transformer-based Model for Latent Space Optimization and Generation of Proteins. (arXiv:2201.09948v2 [cs.LG] UPDATED)
    The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labeled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly-available protein datasets, including variant sets of anti-ranibizumab and GFP. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) by ReLSO compared to other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly-trained ReLSO models provides a potential avenue towards sequence-level fitness attribution information.
    On the Implicit Bias Towards Minimal Depth of Deep Neural Networks. (arXiv:2202.09028v8 [cs.LG] UPDATED)
    Recent results in the literature suggest that the penultimate layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the minimal layer that enjoys neural collapse. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible - we argue that the \emph{rate of collapse} in the intermediate layers is more sensitive, and is closely intertwined with generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance. Finally, we empirically show that the effective depth of a trained neural network monotonically increases when training with extended portions of random labels.
    Mixture of Virtual-Kernel Experts for Multi-Objective User Profile Modeling. (arXiv:2106.07356v2 [cs.IR] UPDATED)
    In industrial applications like online advertising and recommendation systems, diverse and accurate user profiles can greatly help improve personalization. Deep learning is widely applied to mine expressive tags to users from their historical interactions in the system, e.g., click, conversion action in the advertising chain. The usual approach is to take a certain action as the objective, and introduce multiple independent Two-Tower models to predict the possibility of users' action on tags (known as CTR or CVR prediction). The predicted users' high probably attractive tags are to represent their preferences. However, the single-action models cannot learn complementarily and support effective training on data-sparse actions. Besides, limited by the lack of information fusion between the two towers, the model learns insufficiently to represent users' preferences on various tag \textbf{topics} well. This paper introduces a novel multi-task model called Mixture of Virtual-Kernel Experts (MVKE) to learn user preferences on various actions and topics unitedly. In MVKE, we propose a concept of Virtual-Kernel Expert, which focuses on modeling one particular facet of the user's preferences, and all of them learn coordinately. Besides, the gate-based structure used in MVKE builds an information fusion bridge between two towers, improving the model's capability and maintaining high efficiency. We apply the model in Tencent Advertising System, where both online and offline evaluations show that our method has a significant improvement compared with the existing ones and brings about an obvious lift to actual advertising revenue.
    Non-stationary Transformers: Rethinking the Stationarity in Time Series Forecasting. (arXiv:2205.14415v2 [cs.LG] UPDATED)
    Transformers have shown great power in time series forecasting due to their global-range modeling ability. However, their performance can degenerate terribly on non-stationary real-world data in which the joint distribution changes over time. Previous studies primarily adopt stationarization to reduce the non-stationarity of original series for better predictability. But the stationarized series deprived of inherent non-stationarity can be less instructive for real-world bursty events forecasting. This problem, termed over-stationarization in this paper, leads Transformers to generate indistinguishable temporal attentions for different series and impedes the predictive capability of deep models. To tackle the dilemma between series predictability and model capability, we propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Concretely, Series Stationarization unifies the statistics of each input and converts the output with restored statistics for better predictability. To address over-stationarization, De-stationary Attention is devised to recover the intrinsic non-stationary information into temporal dependencies by approximating distinguishable attentions learned from unstationarized series. Our Non-stationary Transformers framework consistently boosts mainstream Transformers by a large margin, which reduces 49.43% MSE on Transformer, 47.34% on Informer, and 46.89% on Reformer, making them the state-of-the-art in time series forecasting.
    A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level. (arXiv:2112.15594v4 [cs.LG] UPDATED)
    We demonstrate that a neural network pre-trained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates new questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a new dataset of questions from MIT's largest mathematics courses (Single Variable and Multivariable Calculus, Differential Equations, Introduction to Probability and Statistics, Linear Algebra, and Mathematics for Computer Science) and Columbia University's Computational Linear Algebra. We solve questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Intermediate Algebra, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems designed to assess mathematical reasoning. We randomly sample questions and generate solutions with multiple modalities, including numbers, equations, and plots. The latest GPT-3 language model pre-trained on text automatically solves only 18.8% of these university questions using zero-shot learning and 30.8% using few-shot learning and the most recent chain of thought prompting. In contrast, program synthesis with few-shot learning using Codex fine-tuned on code generates programs that automatically solve 81% of these questions. Our approach improves the previous state-of-the-art automatic solution accuracy on the benchmark topics from 8.8% to 81.1%. We perform a survey to evaluate the quality and difficulty of generated questions. This work is the first to automatically solve university-level mathematics course questions at a human level and the first work to explain and generate university-level mathematics course questions at scale, a milestone for higher education.
    CAIPI in Practice: Towards Explainable Interactive Medical Image Classification. (arXiv:2204.02661v2 [cs.LG] UPDATED)
    Would you trust physicians if they cannot explain their decisions to you? Medical diagnostics using machine learning gained enormously in importance within the last decade. However, without further enhancements many state-of-the-art machine learning methods are not suitable for medical application. The most important reasons are insufficient data set quality and the black-box behavior of machine learning algorithms such as Deep Learning models. Consequently, end-users cannot correct the model's decisions and the corresponding explanations. The latter is crucial for the trustworthiness of machine learning in the medical domain. The research field explainable interactive machine learning searches for methods that address both shortcomings. This paper extends the explainable and interactive CAIPI algorithm and provides an interface to simplify human-in-the-loop approaches for image classification. The interface enables the end-user (1) to investigate and (2) to correct the model's prediction and explanation, and (3) to influence the data set quality. After CAIPI optimization with only a single counterexample per iteration, the model achieves an accuracy of $97.48\%$ on the Medical MNIST and $95.02\%$ on the Fashion MNIST. This accuracy is approximately equal to state-of-the-art Deep Learning optimization procedures. Besides, CAIPI reduces the labeling effort by approximately $80\%$.
    Nonconvex regularization for sparse neural networks. (arXiv:2004.11515v2 [math.OC] UPDATED)
    Convex $\ell_1$ regularization using an infinite dictionary of neurons has been suggested for constructing neural networks with desired approximation guarantees, but can be affected by an arbitrary amount of over-parametrization. This can lead to a loss of sparsity and result in networks with too many active neurons for the given data, in particular if the number of data samples is large. As a remedy, in this paper, a nonconvex regularization method is investigated in the context of shallow ReLU networks: We prove that in contrast to the convex approach, any resulting (locally optimal) network is finite even in the presence of infinite data (i.e., if the data distribution is known and the limiting case of infinite samples is considered). Moreover, we show that approximation guarantees and existing bounds on the network size for finite data are maintained.
    Fast Predictive Uncertainty for Classification with Bayesian Deep Networks. (arXiv:2003.01227v4 [cs.LG] UPDATED)
    In Bayesian Deep Learning, distributions over the output of classification neural networks are often approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the Categorical distribution) in the output space. Importantly, the vanilla Laplace Bridge comes with certain limitations. We analyze those and suggest a simple solution that compares favorably to other commonly used estimates of the softmax-Gaussian integral. We demonstrate that the resulting Dirichlet distribution has multiple advantages, in particular, more efficient computation of the uncertainty estimate and scaling to large datasets and networks like ImageNet and DenseNet. We further demonstrate the usefulness of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for ImageNet.
    Minimax Classification under Concept Drift with Multidimensional Adaptation and Performance Guarantees. (arXiv:2205.15942v1 [stat.ML])
    The statistical characteristics of instance-label pairs often change with time in practical scenarios of supervised classification. Conventional learning techniques adapt to such concept drift accounting for a scalar rate of change by means of a carefully chosen learning rate, forgetting factor, or window size. However, the time changes in common scenarios are multidimensional, i.e., different statistical characteristics often change in a different manner. This paper presents adaptive minimax risk classifiers (AMRCs) that account for multidimensional time changes by means of a multivariate and high-order tracking of the time-varying underlying distribution. In addition, differently from conventional techniques, AMRCs can provide computable tight performance guarantees. Experiments on multiple benchmark datasets show the classification improvement of AMRCs compared to the state-of-the-art and the reliability of the presented performance guarantees.
    PDE-based Group Equivariant Convolutional Neural Networks. (arXiv:2001.09046v6 [cs.LG] UPDATED)
    We present a PDE-based framework that generalizes Group equivariant Convolutional Neural Networks (G-CNNs). In this framework, a network layer is seen as a set of PDE-solvers where geometrically meaningful PDE-coefficients become the layer's trainable weights. Formulating our PDEs on homogeneous spaces allows these networks to be designed with built-in symmetries such as rotation in addition to the standard translation equivariance of CNNs. Having all the desired symmetries included in the design obviates the need to include them by means of costly techniques such as data augmentation. We will discuss our PDE-based G-CNNs (PDE-G-CNNs) in a general homogeneous space setting while also going into the specifics of our primary case of interest: roto-translation equivariance. We solve the PDE of interest by a combination of linear group convolutions and non-linear morphological group convolutions with analytic kernel approximations that we underpin with formal theorems. Our kernel approximations allow for fast GPU-implementation of the PDE-solvers, we release our implementation with this article in the form of the LieTorch extension to PyTorch, available at https://gitlab.com/bsmetsjr/lietorch . Just like for linear convolution a morphological convolution is specified by a kernel that we train in our PDE-G-CNNs. In PDE-G-CNNs we do not use non-linearities such as max/min-pooling and ReLUs as they are already subsumed by morphological convolutions. We present a set of experiments to demonstrate the strength of the proposed PDE-G-CNNs in increasing the performance of deep learning based imaging applications with far fewer parameters than traditional CNNs.
    An algorithmic solution to the Blotto game using multi-marginal couplings. (arXiv:2202.07318v2 [cs.GT] UPDATED)
    We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm rests on recent theoretical advances regarding Sinkhorn iterations for matrix and tensor scaling. An important case which had been out of reach of previous attempts is that of heterogeneous but symmetric battlefield values with asymmetric budget. In this case, the Blotto game is constant-sum so optimal solutions exist, and our algorithm samples from an $\varepsilon$-optimal solution in time $\tilde{\mathcal{O}}(n^2 + \varepsilon^{-4})$, independently of budgets and battlefield values. In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an $\varepsilon$-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.
    Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization. (arXiv:2006.08505v5 [cs.AI] UPDATED)
    A fascinating aspect of nature lies in its ability to produce a large and diverse collection of organisms that are all high-performing in their niche. By contrast, most AI algorithms focus on finding a single efficient solution to a given problem. Aiming for diversity in addition to performance is a convenient way to deal with the exploration-exploitation trade-off that plays a central role in learning. It also allows for increased robustness when the returned collection contains several working solutions to the considered problem, making it well-suited for real applications such as robotics. Quality-Diversity (QD) methods are evolutionary algorithms designed for this purpose. This paper proposes a novel algorithm, QDPG, which combines the strength of Policy Gradient algorithms and Quality Diversity approaches to produce a collection of diverse and high-performing neural policies in continuous control environments. The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that exploits information at the time-step level to drive policies towards more diversity in a sample-efficient manner. Specifically, QDPG selects neural controllers from a MAP-Elites grid and uses two gradient-based mutation operators to improve both quality and diversity. Our results demonstrate that QDPG is significantly more sample-efficient than its evolutionary competitors.
    The Effect of Diversity in Meta-Learning. (arXiv:2201.11775v2 [cs.LG] UPDATED)
    Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that task distribution plays a vital role in the model's performance. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; we study different task distributions on a myriad of models and datasets to evaluate the effect of task diversity on meta-learning algorithms. For this experiment, we train on multiple datasets, and with three broad classes of meta-learning models - Metric-based (i.e., Protonet, Matching Networks), Optimization-based (i.e., MAML, Reptile, and MetaOptNet), and Bayesian meta-learning models (i.e., CNAPs). Our experiments demonstrate that the effect of task diversity on all these algorithms follows a similar trend, and task diversity does not seem to offer any benefits to the learning of the model. Furthermore, we also demonstrate that even a handful of tasks, repeated over multiple batches, would be sufficient to achieve a performance similar to uniform sampling and draws into question the need for additional tasks to create better models.
    Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms. (arXiv:2009.09538v2 [cs.LG] UPDATED)
    EXP-based algorithms are often used for exploration in non-stochastic bandit problems assuming rewards are bounded. We propose a new algorithm, namely EXP4.P, by modifying EXP4 and establish its upper bound of regret in both bounded and unbounded sub-Gaussian contextual bandit settings. The unbounded reward result also holds for a revised version of EXP3.P. Moreover, we provide a lower bound on regret that suggests no sublinear regret can be achieved given short time horizon. All the analyses do not require bounded rewards compared to classical ones. We also extend EXP4.P from contextual bandit to reinforcement learning to incentivize exploration by multiple agents given black-box rewards. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.
    Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity. (arXiv:2201.09457v6 [cs.LG] UPDATED)
    We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report several properties that seem to be new in the literature of policy gradient methods: (1) HPMD exhibits global linear convergence of the value optimality gap, and local superlinear convergence of the policy to the set of optimal policies with order $\gamma^{-2}$. The superlinear convergence of the policy takes effect after no more than $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function; (2) HPMD also exhibits last-iterate convergence of the policy, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state. No regularization is added to the optimization objective and hence the second observation arises solely as an algorithmic property of the homotopic policy gradient method; (3) The last-iterate convergence of HPMD holds for a much broader class of decomposable distance-generating functions, including the $p$-th power of $\ell_p$-norm and the negative Tsallis entropy. As a byproduct of the analysis, we also discover the finite-time exact convergence of HPMD with these divergences, and show that HPMD continues converging to the limiting policy even if the current policy is already optimal; (4) For the stochastic HPMD method, we further demonstrate that a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| / \epsilon^2)$ sample complexity for small optimality gap $\epsilon$ holds with high probability, when assuming a generative model for policy evaluation.
    Crystal structure prediction with machine learning-based element substitution. (arXiv:2201.11188v2 [cond-mat.mtrl-sci] UPDATED)
    The prediction of energetically stable crystal structures formed by a given chemical composition is a central problem in solid-state physics. In principle, the crystalline state of assembled atoms can be determined by optimizing the energy surface, which in turn can be evaluated using first-principles calculations. However, performing the iterative gradient descent on the potential energy surface using first-principles calculations is prohibitively expensive for complex systems, such as those with many atoms per unit cell. Here, we present a unique methodology for crystal structure prediction (CSP) that relies on a machine learning algorithm called metric learning. It is shown that a binary classifier, trained on a large number of already identified crystal structures, can determine the isomorphism of crystal structures formed by two given chemical compositions with an accuracy of approximately 96.4\%. For a given query composition with an unknown crystal structure, the model is used to automatically select from a crystal structure database a set of template crystals with nearly identical stable structures to which element substitution is to be applied. Apart from the local relaxation calculation of the identified templates, the proposed method does not use ab initio calculations. The potential of this substation-based CSP is demonstrated for a wide variety of crystal systems.
    Generate, Annotate, and Learn: NLP with Synthetic Text. (arXiv:2106.06168v3 [cs.LG] UPDATED)
    This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called ``generate, annotate, and learn (GAL)'' to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text. GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard.
    A Reduction to Binary Approach for Debiasing Multiclass Datasets. (arXiv:2205.15860v1 [cs.LG])
    We propose a novel reduction-to-binary (R2B) approach that enforces demographic parity for multiclass classification with non-binary sensitive attributes via a reduction to a sequence of binary debiasing tasks. We prove that R2B satisfies optimality and bias guarantees and demonstrate empirically that it can lead to an improvement over two baselines: (1) treating multiclass problems as multi-label by debiasing labels independently and (2) transforming the features instead of the labels. Surprisingly, we also demonstrate that independent label debiasing yields competitive results in most (but not all) settings. We validate these conclusions on synthetic and real-world datasets from social science, computer vision, and healthcare.
    Exploring Representational Alignment with Human Perception Using Identically Represented Inputs. (arXiv:2111.14726v2 [cs.CV] UPDATED)
    We contribute to the study of the quality of learned representations. In many domains, an important evaluation criterion for safe and trustworthy deep learning is how well the invariances captured by representations of deep neural networks (DNNs) are shared with humans. We identify challenges in measuring these invariances. Prior works used gradient-based methods to generate \textit{identically represented inputs} (IRIs), \ie, inputs which have similar representations (on a given layer) of a neural network. If these IRIs look `similar' to humans then a neural network's learned invariances are said to align with human perception. However, we show that prior studies on the alignment of invariances between DNNs and humans are `biased' by the specific loss function used to generate IRIs. We show how different loss functions can lead to different takeaways about a model's shared invariances with humans. We show that under an \textit{adversarial} IRI~generation process all models appear to have very little shared invariance with humans. We conduct an in-depth investigation of how different components of the deep learning pipeline contribute to learning models that have good alignment with human's invariances. We find that architectures with residual connections trained using a self-supervised contrastive loss with $\ell_p$ ball adversarial data augmentation tend to learn the most human-like invariances.
    Polynomial-time Sparse Deconvolution. (arXiv:2204.07879v2 [cs.LG] UPDATED)
    How can a probability measure be recovered with sparse support from its generalized moments? This problem, called sparse deconvolution, has been the focus of research in mathematics, theoretical computer science, and neural computing. However, there is no polynomial-time algorithm for the recovery. The best algorithm requires $O\left(\text{dimension}^{\text{poly}(1/\epsilon)}\right)$ for $\epsilon$-accurate recovery. We propose the first poly-time recovery method from carefully designed moments that requires $O\left(\text{dimension}^4\log(1/\epsilon)/\epsilon^2\right)$ computations for an $\epsilon$-accurate recovery. This method relies on learning a planted two-layer neural network with two-dimensional inputs, a finite width, and zero-one activation. For learning such networks, we establish the first poly-time complexity, and demonstrate its application in sparse deconvolution.
    Inducing bias is simpler than you think. (arXiv:2205.15935v1 [cs.LG])
    Machine learning may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. To counter this, some of the model accuracy can be traded off for a secondary objective that helps prevent a specific type of bias. Multiple notions of fairness have been proposed to this end but recent studies show that some fairness criteria often stand in mutual competition. In the present work, we introduce a solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical behaviour of learning models trained in our synthetic framework and find similar unfairness behaviours as those observed on more realistic data. However, we also identify a positive transfer effect between the different subpopulations within the data. This suggests that mixing data with different statistical properties could be helpful, provided the learning model is made aware of this structure. Finally, we analyse the issue of bias mitigation: by reweighing the various terms in the training loss, we indirectly minimise standard unfairness metrics and highlight their incompatibilities. Leveraging the insights on positive transfer, we also propose a theory-informed mitigation strategy, based on the introduction of coupled learning models. By allowing each model to specialise on a different community within the data, we find that multiple fairness criteria and high accuracy can be achieved simultaneously.
    Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning. (arXiv:2202.06083v2 [cs.LG] UPDATED)
    In recent centralized nonconvex distributed learning and federated learning, local methods are one of the promising approaches to reduce communication time. However, existing work has mainly focused on studying first-order optimality guarantees. On the other side, second-order optimality guaranteed algorithms, i.e., algorithms escaping saddle points, have been extensively studied in the non-distributed optimization literature. In this paper, we study a new local algorithm called Bias-Variance Reduced Local Perturbed SGD (BVR-L-PSGD), that combines the existing bias-variance reduced gradient estimator with parameter perturbation to find second-order optimal points in centralized nonconvex distributed optimization. BVR-L-PSGD enjoys second-order optimality with nearly the same communication complexity as the best known one of BVR-L-SGD to find first-order optimality. Particularly, the communication complexity is better than non-local methods when the local datasets heterogeneity is smaller than the smoothness of the local loss. In an extreme case, the communication complexity approaches to $\widetilde \Theta(1)$ when the local datasets heterogeneity goes to zero. Numerical results validate our theoretical findings.
    Cross-view kernel transfer. (arXiv:1910.05964v2 [cs.LG] UPDATED)
    We consider the kernel completion problem with the presence of multiple views in the data. In this context the data samples can be fully missing in some views, creating missing columns and rows to the kernel matrices that are calculated individually for each view. We propose to solve the problem of completing the kernel matrices with Cross-View Kernel Transfer (CVKT) procedure, in which the features of the other views are transformed to represent the view under consideration. The transformations are learned with kernel alignment to the known part of the kernel matrix, allowing for finding generalizable structures in the kernel matrix under completion. Its missing values can then be predicted with the data available in other views. We illustrate the benefits of our approach with simulated data, multivariate digits dataset and multi-view dataset on gesture classification, as well as with real biological datasets from studies of pattern formation in early \textit{Drosophila melanogaster} embryogenesis.
    Surface Analysis with Vision Transformers. (arXiv:2205.15836v1 [cs.CV])
    The extension of convolutional neural networks (CNNs) to non-Euclidean geometries has led to multiple frameworks for studying manifolds. Many of those methods have shown design limitations resulting in poor modelling of long-range associations, as the generalisation of convolutions to irregular surfaces is non-trivial. Recent state-of-the-art performance of Vision Transformers (ViTs) demonstrates that a general-purpose architecture, which implements self-attention, could replace the local feature learning operations of CNNs. Motivated by the success of attention-modelling in computer vision, we extend ViTs to surfaces by reformulating the task of surface learning as a sequence-to-sequence problem and propose a patching mechanism for surface meshes. We validate the performance of the proposed Surface Vision Transformer (SiT) on two brain age prediction tasks in the developing Human Connectome Project (dHCP) dataset and investigate the impact of pre-training on model performance. Experiments show that the SiT outperforms many surface CNNs, while indicating some evidence of general transformation invariance. Code available at https://github.com/metrics-lab/surface-vision-transformers
    One Policy is Enough: Parallel Exploration with a Single Policy is Minimax Optimal for Reward-Free Reinforcement Learning. (arXiv:2205.15891v1 [cs.LG])
    While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL for linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is minimax optimal up to logarithmic factors in the reward-free setting for both linear MDPs and two-player zero-sum MGs. From a practical perspective, our paper shows that a single policy is sufficient and provably optimal for incorporating parallelism during the exploration phase.
    TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving. (arXiv:2205.15997v1 [cs.CV])
    How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.
    Forward and inverse reinforcement learning sharing network weights and hyperparameters. (arXiv:2008.07284v2 [cs.LG] UPDATED)
    This paper proposes model-free imitation learning named Entropy-Regularized Imitation Learning (ERIL) that minimizes the reverse Kullback-Leibler (KL) divergence. ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process. An inverse RL step computes the log-ratio between two distributions by evaluating two binary discriminators. The first discriminator distinguishes the state generated by the forward RL step from the expert's state. The second discriminator, which is structured by the theory of entropy regularization, distinguishes the state-action-next-state tuples generated by the learner from the expert ones. One notable feature is that the second discriminator shares hyperparameters with the forward RL, which can be used to control the discriminator's ability. A forward RL step minimizes the reverse KL estimated by the inverse RL step. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy. Our experimental results on MuJoCo-simulated environments and vision-based reaching tasks with a robotic arm show that ERIL is more sample-efficient than the baseline methods. We apply the method to human behaviors that perform a pole-balancing task and describe how the estimated reward functions show how every subject achieves her goal.
    CoRe: Color Regression for Multicolor Fashion Garments. (arXiv:2010.02849v2 [cs.CV] UPDATED)
    Developing deep networks that analyze fashion garments has many real-world applications. Among all fashion attributes, color is one of the most important yet challenging to detect. Existing approaches are classification-based and thus cannot go beyond the list of discrete predefined color names. In this paper, we handle color detection as a regression problem to predict the exact RGB values. That's why in addition to a first color classifier, we include a second regression stage for refinement in our newly proposed architecture. This second step combines two attention models: the first depends on the type of clothing, the second depends on the color previously detected by the classifier. Our final prediction is the weighted spatial pooling over the image pixels RGB values, where the illumination has been corrected. This architecture is modular and easily expanded to detect the RGBs of all colors in a multicolor garment. In our experiments, we show the benefits of each component of our architecture.
    AdaTask: Adaptive Multitask Online Learning. (arXiv:2205.15802v1 [cs.LG])
    We introduce and analyze AdaTask, a multitask online learning algorithm that adapts to the unknown structure of the tasks. When the $N$ tasks are stochastically activated, we show that the regret of AdaTask is better, by a factor that can be as large as $\sqrt{N}$, than the regret achieved by running $N$ independent algorithms, one for each task. AdaTask can be seen as a comparator-adaptive version of Follow-the-Regularized-Leader with a Mahalanobis norm potential. Through a variational formulation of this potential, our analysis reveals how AdaTask jointly learns the tasks and their structure. Experiments supporting our findings are presented.
    Neural Topic Model via Optimal Transport. (arXiv:2008.13537v3 [cs.IR] UPDATED)
    Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.
    Thompson Sampling for Bandits with Clustered Arms. (arXiv:2109.01656v2 [cs.LG] UPDATED)
    We propose algorithms based on a multi-level Thompson sampling scheme, for the stochastic multi-armed bandit and its contextual variant with linear expected rewards, in the setting where arms are clustered. We show, both theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost compared to using standard Thompson sampling. In the case of the stochastic multi-armed bandit we give upper bounds on the expected cumulative regret showing how it depends on the quality of the clustering. Finally, we perform an empirical evaluation showing that our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms.
    Non-Iterative Recovery from Nonlinear Observations using Generative Models. (arXiv:2205.15749v1 [cs.LG])
    In this paper, we aim to estimate the direction of an underlying signal from its nonlinear observations following the semi-parametric single index model (SIM). Unlike conventional compressed sensing where the signal is assumed to be sparse, we assume that the signal lies in the range of an $L$-Lipschitz continuous generative model with bounded $k$-dimensional inputs. This is mainly motivated by the tremendous success of deep generative models in various real applications. Our reconstruction method is non-iterative (though approximating the projection step may use an iterative procedure) and highly efficient, and it is shown to attain the near-optimal statistical rate of order $\sqrt{(k \log L)/m}$, where $m$ is the number of measurements. We consider two specific instances of the SIM, namely noisy $1$-bit and cubic measurement models, and perform experiments on image datasets to demonstrate the efficacy of our method. In particular, for the noisy $1$-bit measurement model, we show that our non-iterative method significantly outperforms a state-of-the-art iterative method in terms of both accuracy and efficiency.
    EdgeML: Towards Network-Accelerated Federated Learning over Wireless Edge. (arXiv:2111.09410v4 [cs.NI] UPDATED)
    Federated learning (FL) is a distributed machine learning technology for next-generation AI systems that allows a number of workers, i.e., edge devices, collaboratively learn a shared global model while keeping their data locally to prevent privacy leakage. Enabling FL over wireless multi-hop networks can democratize AI and make it accessible in a cost-effective manner. However, the noisy bandwidth-limited multi-hop wireless connections can lead to delayed and nomadic model updates, which significantly slows down the FL convergence speed. To address such challenges, this paper aims to accelerate FL convergence over wireless edge by optimizing the multi-hop federated networking performance. In particular, the FL convergence optimization problem is formulated as a Markov decision process (MDP). To solve such MDP, multi-agent reinforcement learning (MA-RL) algorithms along with domain-specific action space refining schemes are developed, which online learn the delay-minimum forwarding paths to minimize the model exchange latency between the edge devices (i.e., workers) and the remote server. To validate the proposed solutions, FedEdge is developed and implemented, which is the first experimental framework in the literature for FL over multi-hop wireless edge computing networks. FedEdge allows us to fast prototype, deploy, and evaluate novel FL algorithms along with RL-based system optimization methods in real wireless devices. Moreover, a physical experimental testbed is implemented by customizing the widely adopted Linux wireless routers and ML computing nodes.Finally, our experimentation results on the testbed show that the proposed network-accelerated FL system can practically and significantly improve FL convergence speed, compared to the FL system empowered by the production-grade commercially available wireless networking protocol, BATMAN-Adv.
    Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles. (arXiv:2109.06126v3 [cs.SE] UPDATED)
    Self-driving cars and trucks, autonomous vehicles (AVs), should not be accepted by regulatory bodies and the public until they have much higher confidence in their safety and reliability -- which can most practically and convincingly be achieved by testing. But existing testing methods are inadequate for checking the end-to-end behaviors of AV controllers against complex, real-world corner cases involving interactions with multiple independent agents such as pedestrians and human-driven vehicles. While test-driving AVs on streets and highways fails to capture many rare events, existing simulation-based testing methods mainly focus on simple scenarios and do not scale well for complex driving situations that require sophisticated awareness of the surroundings. To address these limitations, we propose a new fuzz testing technique, called AutoFuzz, which can leverage widely-used AV simulators' API grammars to generate semantically and temporally valid complex driving scenarios (sequences of scenes). To efficiently search for traffic violations-inducing scenarios in a large search space, we propose a constrained neural network (NN) evolutionary search method to optimize AutoFuzz. Evaluation of our prototype on one state-of-the-art learning-based controller, two rule-based controllers, and one industrial-grade controller in five scenarios shows that AutoFuzz efficiently finds hundreds of traffic violations in high-fidelity simulation environments. For each scenario, AutoFuzz can find on average 10-39% more unique traffic violations than the best-performing baseline method. Further, fine-tuning the learning-based controller with the traffic violations found by AutoFuzz successfully reduced the traffic violations found in the new version of the AV controller software.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v1 [cs.LG])
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Accelerated Quality-Diversity for Robotics through Massive Parallelism. (arXiv:2202.01258v2 [cs.NE] UPDATED)
    Quality-Diversity (QD) optimization algorithms are a well-known approach to generate large collections of diverse and high-quality solutions. However, derived from evolutionary computation, QD algorithms are population-based methods which are known to be data-inefficient and requires large amounts of computational resources. This makes QD algorithms slow when used in applications where solution evaluations are computationally costly. A common approach to speed up QD algorithms is to evaluate solutions in parallel, for instance by using physical simulators in robotics. Yet, this approach is limited to several dozen of parallel evaluations as most physics simulators can only be parallelized more with a greater number of CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can now be performed in parallel on single GPU/TPU. In this paper, we present QDax, an accelerated implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We show that QD algorithms are ideal candidates to take advantage of progress in hardware acceleration. We demonstrate that QD algorithms can scale with massive parallelism to be run at interactive timescales without any significant effect on the performance. Results across standard optimization functions and four neuroevolution benchmark environments shows that experiment runtimes are reduced by two factors of magnitudes, turning days of computation into minutes. More surprising, we observe that reducing the number of generations by two orders of magnitude, and thus having significantly shorter lineage does not impact the performance of QD algorithms. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning.
    coVariance Neural Networks. (arXiv:2205.15856v1 [cs.LG])
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we propose a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.
    Deep Visual Navigation under Partial Observability. (arXiv:2109.07752v3 [cs.RO] UPDATED)
    How can a robot navigate successfully in rich and diverse environments, indoors or outdoors, along office corridors or trails on the grassland, on the flat ground or the staircase? To this end, this work aims to address three challenges: (i) complex visual observations, (ii) partial observability of local visual sensing, and (iii) multimodal robot behaviors conditioned on both the local environment and the global navigation objective. We propose to train a neural network (NN) controller for local navigation via imitation learning. To tackle complex visual observations, we extract multi-scale spatial representations through CNNs. To tackle partial observability, we aggregate multi-scale spatial information over time and encode it in LSTMs. To learn multimodal behaviors, we use a separate memory module for each behavior mode. Importantly, we integrate the multiple neural network modules into a unified controller that achieves robust performance for visual navigation in complex, partially observable environments. We implemented the controller on the quadrupedal Spot robot and evaluated it on three challenging tasks: adversarial pedestrian avoidance, blind-spot obstacle avoidance, and elevator riding. The experiments show that the proposed NN architecture significantly improves navigation performance.
    Rethinking Learning Dynamics in RL using Adversarial Networks. (arXiv:2201.11783v2 [cs.LG] UPDATED)
    We present a learning mechanism for reinforcement learning of closely related skills parameterized via a skill embedding space. Our approach is grounded on the intuition that nothing makes you learn better than a coevolving adversary. The main contribution of our work is to formulate an adversarial training regime for reinforcement learning with the help of entropy-regularized policy gradient formulation. We also adapt existing measures of causal attribution to draw insights from the skills learned. This would facilitate easier re-purposing of skills for adaptation to different environments and tasks. Our experiments demonstrate that the adversarial process leads to a better exploration of multiple solutions and understanding the minimum number of different skills necessary to solve a given set of tasks.
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v1 [stat.ML])
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of x, by providing an exact characterization of the functions $f$ to which impossibility applies.
    Knowledge Graph -- Deep Learning: A Case Study in Question Answering in Aviation Safety Domain. (arXiv:2205.15952v1 [cs.CL])
    In the commercial aviation domain, there are a large number of documents, like, accident reports (NTSB, ASRS) and regulatory directives (ADs). There is a need for a system to access these diverse repositories efficiently in order to service needs in the aviation industry, like maintenance, compliance, and safety. In this paper, we propose a Knowledge Graph (KG) guided Deep Learning (DL) based Question Answering (QA) system for aviation safety. We construct a Knowledge Graph from Aircraft Accident reports and contribute this resource to the community of researchers. The efficacy of this resource is tested and proved by the aforesaid QA system. Natural Language Queries constructed from the documents mentioned above are converted into SPARQL (the interface language of the RDF graph database) queries and answered. On the DL side, we have two different QA models: (i) BERT QA which is a pipeline of Passage Retrieval (Sentence-BERT based) and Question Answering (BERT based), and (ii) the recently released GPT-3. We evaluate our system on a set of queries created from the accident reports. Our combined QA system achieves 9.3% increase in accuracy over GPT-3 and 40.3% increase over BERT QA. Thus, we infer that KG-DL performs better than either singly.
    Consistent Relative Confidence and Label-Free Model Selection for Convolutional Neural Networks. (arXiv:2108.11845v9 [cs.CV] UPDATED)
    In this paper, we are concerned with image classification with deep convolutional neural networks (CNNs). We focus on the following question: given a set of candidate CNN models, how to select the right one with the best generalization property for the current task? Current model selection methods all require access to a batch of labeled data for computing a pre-specified performance metric, such as the cross-entropy loss, the classification error rate and the negative log-likelihood. In many practical cases, labels are not available in time as labeling itself is a time-consuming and expensive task. To this end, we propose an approach to CNN model selection using only unlabeled data. We develop this method based on a principle termed consistent relative confidence. Experimental results on benchmark datasets demonstrate the effectiveness and efficiency of our method.
    Semi-Supervised Cross-Silo Advertising with Partial Knowledge Transfer. (arXiv:2205.15987v1 [cs.LG])
    As an emerging secure learning paradigm in leveraging cross-agency private data, vertical federated learning (VFL) is expected to improve advertising models by enabling the joint learning of complementary user attributes privately owned by the advertiser and the publisher. However, there are two key challenges in applying it to advertising systems: a) the limited scale of labeled overlapping samples, and b) the high cost of real-time cross-agency serving. In this paper, we propose a semi-supervised split distillation framework VFed-SSD to alleviate the two limitations. We identify that: i) there are massive unlabeled overlapped data available in advertising systems, and ii) we can keep a balance between model performance and inference cost by decomposing the federated model. Specifically, we develop a self-supervised task Matched Pair Detection (MPD) to exploit the vertically partitioned unlabeled data and propose the Split Knowledge Distillation (SplitKD) schema to avoid cross-agency serving. Empirical studies on three industrial datasets exhibit the effectiveness of our methods, with the median AUC over all datasets improved by 0.86% and 2.6% in the local deployment mode and the federated deployment mode respectively. Overall, our framework provides an efficient federation-enhanced solution for real-time display advertising with minimal deploying cost and significant performance lift.
    Online Meta-Learning in Adversarial Multi-Armed Bandits. (arXiv:2205.15921v1 [cs.LG])
    We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v4 [cs.LG] UPDATED)
    Black-box machine learning learning methods are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Distribution-free uncertainty quantification (distribution-free UQ) is a user-friendly paradigm for creating statistically rigorous confidence intervals/sets for such predictions. Critically, the intervals/sets are valid without distributional assumptions or model assumptions, possessing explicit guarantees even with finitely many datapoints. Moreover, they adapt to the difficulty of the input; when the input example is difficult, the uncertainty intervals/sets are large, signaling that the model might be wrong. Without much work and without retraining, one can use distribution-free methods on any underlying algorithm, such as a neural network, to produce confidence sets guaranteed to contain the ground truth with a user-specified probability, such as 90%. Indeed, the methods are easy-to-understand and general, applying to many modern prediction problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed at a reader interested in the practical implementation of distribution-free UQ who is not necessarily a statistician. We lead the reader through the practical theory and applications of distribution-free UQ, beginning with conformal prediction and culminating with distribution-free control of any risk, such as the false-discovery rate, false positive rate of out-of-distribution detection, and so on. We will include many explanatory illustrations, examples, and code samples in Python, with PyTorch syntax. The goal is to provide the reader a working understanding of distribution-free UQ, allowing them to put confidence intervals on their algorithms, with one self-contained document.
    Variable importance without impossible data. (arXiv:2205.15750v1 [cs.LG])
    The most popular methods for measuring importance of the variables in a black box prediction algorithm make use of synthetic inputs that combine predictor variables from multiple subjects. These inputs can be unlikely, physically impossible, or even logically impossible. As a result, the predictions for such cases can be based on data very unlike any the black box was trained on. We think that users cannot trust an explanation of the decision of a prediction algorithm when the explanation uses such values. Instead we advocate a method called Cohort Shapley that is grounded in economic game theory and unlike most other game theoretic methods, it uses only actually observed data to quantify variable importance. Cohort Shapley works by narrowing the cohort of subjects judged to be similar to a target subject on one or more features. A feature is important if using it to narrow the cohort makes a large difference to the cohort mean. We illustrate it on an algorithmic fairness problem where it is essential to attribute importance to protected variables that the model was not trained on. For every subject and every predictor variable, we can compute the importance of that predictor to the subject's predicted response or to their actual response. These values can be aggregated, for example over all Black subjects, and we propose a Bayesian bootstrap to quantify uncertainty in both individual and aggregate Shapley values.
    Kymatio: Scattering Transforms in Python. (arXiv:1812.11214v3 [cs.LG] UPDATED)
    The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at https://www.kymat.io/
    Hide and Seek: on the Stealthiness of Attacks against Deep Learning Systems. (arXiv:2205.15944v1 [cs.CR])
    With the growing popularity of artificial intelligence and machine learning, a wide spectrum of attacks against deep learning models have been proposed in the literature. Both the evasion attacks and the poisoning attacks attempt to utilize adversarially altered samples to fool the victim model to misclassify the adversarial sample. While such attacks claim to be or are expected to be stealthy, i.e., imperceptible to human eyes, such claims are rarely evaluated. In this paper, we present the first large-scale study on the stealthiness of adversarial samples used in the attacks against deep learning. We have implemented 20 representative adversarial ML attacks on six popular benchmarking datasets. We evaluate the stealthiness of the attack samples using two complementary approaches: (1) a numerical study that adopts 24 metrics for image similarity or quality assessment; and (2) a user study of 3 sets of questionnaires that has collected 20,000+ annotations from 1,000+ responses. Our results show that the majority of the existing attacks introduce nonnegligible perturbations that are not stealthy to human eyes. We further analyze the factors that contribute to attack stealthiness. We further examine the correlation between the numerical analysis and the user studies, and demonstrate that some image quality metrics may provide useful guidance in attack designs, while there is still a significant gap between assessed image quality and visual stealthiness of attacks.
    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap. (arXiv:2201.04469v6 [stat.ML] UPDATED)
    We consider fixed-budget best arm identification in two-armed bandit problems. One of the longstanding open questions is a tight lower bound on the probability of misidentifying the best arm and a strategy whose upper bound matches the lower bound when the optimal target allocation ratio of arm draws is unknown. We address this problem when the gap between the expected rewards is small. First, we introduce a distribution-dependent lower bound. Then, we propose the ``RS-AIPW'' strategy, which consists of the random sampling (RS) rule using the estimated optimal target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed strategy is optimal in the sense that the upper bound achieves the lower bound when the budget goes to infinity and the gap goes to zero. In the course of the analysis, we present a novel large deviation bound for martingales.
    Modeling Interactions of Autonomous Vehicles and Pedestrians with Deep Multi-Agent Reinforcement Learning for Collision Avoidance. (arXiv:2109.15266v3 [cs.RO] UPDATED)
    Reliable pedestrian crash avoidance mitigation (PCAM) systems are crucial components of safe autonomous vehicles (AVs). The nature of the vehicle-pedestrian interaction where decisions of one agent directly affect the other agent's optimal behavior, and vice versa, is a challenging yet often neglected aspect of such systems. We address this issue by modeling a Markov decision process (MDP) for a simulated AV-pedestrian interaction at an unmarked crosswalk. The AV's PCAM decision policy is learned through deep reinforcement learning (DRL). Since modeling pedestrians realistically is challenging, we compare two levels of intelligent pedestrian behavior. While the baseline model follows a predefined strategy, our advanced pedestrian model is defined as a second DRL agent. This model captures continuous learning and the uncertainty inherent in human behavior, making the AV-pedestrian interaction a deep multi-agent reinforcement learning (DMARL) problem. We benchmark the developed PCAM systems according to the collision rate and the resulting traffic flow efficiency with a focus on the influence of observation uncertainty on the decision-making of the agents. The results show that the AV is able to completely mitigate collisions under the majority of the investigated conditions and that the DRL pedestrian model learns an intelligent crossing behavior.
    Strategic Classification with Graph Neural Networks. (arXiv:2205.15765v1 [cs.LG])
    Strategic classification studies learning in settings where users can modify their features to obtain favorable predictions. Most current works focus on simple classifiers that trigger independent user responses. Here we examine the implications of learning with more elaborate models that break the independence assumption. Motivated by the idea that applications of strategic classification are often social in nature, we focus on \emph{graph neural networks}, which make use of social relations between users to improve predictions. Using a graph for learning introduces inter-user dependencies in prediction; our key point is that strategic users can exploit these to promote their goals. As we show through analysis and simulation, this can work either against the system -- or for it. Based on this, we propose a differentiable framework for strategically-robust learning of graph-based classifiers. Experiments on several real networked datasets demonstrate the utility of our approach.
    Compressed Hierarchical Representations for Multi-Task Learning and Task Clustering. (arXiv:2205.15882v1 [cs.LG])
    In this paper, we frame homogeneous-feature multi-task learning (MTL) as a hierarchical representation learning problem, with one task-agnostic and multiple task-specific latent representations. Drawing inspiration from the information bottleneck principle and assuming an additive independent noise model between the task-agnostic and task-specific latent representations, we limit the information contained in each task-specific representation. It is shown that our resulting representations yield competitive performance for several MTL benchmarks. Furthermore, for certain setups, we show that the trained parameters of the additive noise model are closely related to the similarity of different tasks. This indicates that our approach yields a task-agnostic representation that is disentangled in the sense that its individual dimensions may be interpretable from a task-specific perspective.
    Smoothed Online Learning is as Easy as Statistical Learning. (arXiv:2202.04690v3 [stat.ML] UPDATED)
    Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.
    Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems. (arXiv:2205.15656v1 [cs.LG])
    Model-free deep-reinforcement-based learning algorithms have been applied to a range of COPs~\cite{bello2016neural}~\cite{kool2018attention}~\cite{nazari2018reinforcement}. However, these approaches suffer from two key challenges when applied to combinatorial problems: insufficient exploration and the requirement of many training examples of the search space to achieve reasonable performance. Combinatorial optimisation can be complex, characterised by search spaces with many optimas and large spaces to search and learn. Therefore, a new method is needed to find good solutions that are more efficient by being more sample efficient. This paper presents a new reinforcement learning approach that is based on entropy. In addition, we design an off-policy-based reinforcement learning technique that maximises the expected return and improves the sample efficiency to achieve faster learning during training time. We systematically evaluate our approach on a range of route optimisation tasks typically used to evaluate learning-based optimisation, such as the such as the Travelling Salesman problems (TSP), Capacitated Vehicle Routing Problem (CVRP). In this paper, we show that our model can generalise to various route problems, such as the split-delivery VRP (SDVRP), and compare the performance of our method with that of current state-of-the-art approaches. The Empirical results show that the proposed method can improve on state-of-the-art methods in terms of solution quality and computation time and generalise to problems of different sizes.
    Unbalanced CO-Optimal Transport. (arXiv:2205.14923v2 [stat.ML] UPDATED)
    Optimal transport (OT) compares probability distributions by computing a meaningful alignment between their samples. CO-optimal transport (COOT) takes this comparison further by inferring an alignment between features as well. While this approach leads to better alignments and generalizes both OT and Gromov-Wasserstein distances, we provide a theoretical result showing that it is sensitive to outliers that are omnipresent in real-world data. This prompts us to propose unbalanced COOT for which we provably show its robustness to noise in the compared datasets. To the best of our knowledge, this is the first such result for OT methods in incomparable spaces. With this result in hand, we provide empirical evidence of this robustness for the challenging tasks of heterogeneous domain adaptation with and without varying proportions of classes and simultaneous alignment of samples and features across single-cell measurements.
    Differentially Private Covariance Revisited. (arXiv:2205.14324v2 [cs.CR] UPDATED)
    In this paper, we present three new error bounds, in terms of the Frobenius norm, for covariance estimation under differential privacy: (1) a worst-case bound of $\tilde{O}(d^{1/4}/\sqrt{n})$, which improves the standard Gaussian mechanism $\tilde{O}(d/n)$ for the regime $d>\widetilde{\Omega}(n^{2/3})$; (2) a trace-sensitive bound that improves the state of the art by a $\sqrt{d}$-factor, and (3) a tail-sensitive bound that gives a more instance-specific result. The corresponding algorithms are also simple and efficient. Experimental results show that they offer significant improvements over prior work.
    Mixture GAN For Modulation Classification Resiliency Against Adversarial Attacks. (arXiv:2205.15743v1 [cs.LG])
    Automatic modulation classification (AMC) using the Deep Neural Network (DNN) approach outperforms the traditional classification techniques, even in the presence of challenging wireless channel environments. However, the adversarial attacks cause the loss of accuracy for the DNN-based AMC by injecting a well-designed perturbation to the wireless channels. In this paper, we propose a novel generative adversarial network (GAN)-based countermeasure approach to safeguard the DNN-based AMC systems against adversarial attack examples. GAN-based aims to eliminate the adversarial attack examples before feeding to the DNN-based classifier. Specifically, we have shown the resiliency of our proposed defense GAN against the Fast-Gradient Sign method (FGSM) algorithm as one of the most potent kinds of attack algorithms to craft the perturbed signals. The existing defense-GAN has been designed for image classification and does not work in our case where the above-mentioned communication system is considered. Thus, our proposed countermeasure approach deploys GANs with a mixture of generators to overcome the mode collapsing problem in a typical GAN facing radio signal classification problem. Simulation results show the effectiveness of our proposed defense GAN so that it could enhance the accuracy of the DNN-based AMC under adversarial attacks to 81%, approximately.
    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. (arXiv:2205.15868v1 [cs.CV])
    Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
    Robust Anytime Learning of Markov Decision Processes. (arXiv:2205.15827v1 [cs.AI])
    Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
    Transformers for Multi-Object Tracking on Point Clouds. (arXiv:2205.15730v1 [cs.CV])
    We present TransMOT, a novel transformer-based end-to-end trainable online tracker and detector for point cloud data. The model utilizes a cross- and a self-attention mechanism and is applicable to lidar data in an automotive context, as well as other data types, such as radar. Both track management and the detection of new tracks are performed by the same transformer decoder module and the tracker state is encoded in feature space. With this approach, we make use of the rich latent space of the detector for tracking rather than relying on low-dimensional bounding boxes. Still, we are able to retain some of the desirable properties of traditional Kalman-filter based approaches, such as an ability to handle sensor input at arbitrary timesteps or to compensate frame skips. This is possible due to a novel module that transforms the track information from one frame to the next on feature-level and thereby fulfills a similar task as the prediction step of a Kalman filter. Results are presented on the challenging real-world dataset nuScenes, where the proposed model outperforms its Kalman filter-based tracking baseline.
    Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity. (arXiv:2205.15809v1 [stat.ML])
    We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone. This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum which requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
    Variational inference via Wasserstein gradient flows. (arXiv:2205.15902v1 [stat.ML])
    Along with Markov chain Monte Carlo (MCMC) methods, variational inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior $\pi$, VI aims at producing a simple but effective approximation $\hat \pi$ to $\pi$ for which summary statistics are easy to compute. However, unlike the well-studied MCMC methodology, VI is still poorly understood and dominated by heuristics. In this work, we propose principled methods for VI, in which $\hat \pi$ is taken to be a Gaussian or a mixture of Gaussians, which rest upon the theory of gradient flows on the Bures-Wasserstein space of Gaussian measures. Akin to MCMC, it comes with strong theoretical guarantees when $\pi$ is log-concave.
    Omni-Granular Ego-Semantic Propagation for Self-Supervised Graph Representation Learning. (arXiv:2205.15746v1 [cs.LG])
    Unsupervised/self-supervised graph representation learning is critical for downstream node- and graph-level classification tasks. Global structure of graphs helps discriminating representations and existing methods mainly utilize the global structure by imposing additional supervisions. However, their global semantics are usually invariant for all nodes/graphs and they fail to explicitly embed the global semantics to enrich the representations. In this paper, we propose Omni-Granular Ego-Semantic Propagation for Self-Supervised Graph Representation Learning (OEPG). Specifically, we introduce instance-adaptive global-aware ego-semantic descriptors, leveraging the first- and second-order feature differences between each node/graph and hierarchical global clusters of the entire graph dataset. The descriptors can be explicitly integrated into local graph convolution as new neighbor nodes. Besides, we design an omni-granular normalization on the whole scales and hierarchies of the ego-semantic to assign attentional weight to each descriptor from an omni-granular perspective. Specialized pretext tasks and cross-iteration momentum update are further developed for local-global mutual adaptation. In downstream tasks, OEPG consistently achieves the best performance with a 2%~6% accuracy gain on multiple datasets cross scales and domains. Notably, OEPG also generalizes to quantity- and topology-imbalance scenarios.
    Learning to branch with Tree MDPs. (arXiv:2205.11107v2 [cs.LG] UPDATED)
    State-of-the-art Mixed Integer Linear Program (MILP) solvers combine systematic tree search with a plethora of hard-coded heuristics, such as the branching rule. The idea of learning branching rules from data has received increasing attention recently, and promising results have been obtained by learning fast approximations of the strong branching expert. In this work, we instead propose to learn branching rules from scratch via Reinforcement Learning (RL). We revisit the work of Etheve et al. (2020) and propose tree Markov Decision Processes, or tree MDPs, a generalization of temporal MDPs that provides a more suitable framework for learning to branch. We derive a tree policy gradient theorem, which exhibits a better credit assignment compared to its temporal counterpart. We demonstrate through computational experiments that tree MDPs improve the learning convergence, and offer a promising framework for tackling the learning-to-branch problem in MILPs.
    Augmentations: An Insight into their Effectiveness on Convolution Neural Networks. (arXiv:2205.04064v2 [cs.LG] UPDATED)
    Augmentations are the key factor in determining the performance of any neural network as they provide a model with a critical edge in boosting its performance. Their ability to boost a model's robustness depends on two factors, viz-a-viz, the model architecture, and the type of augmentations. Augmentations are very specific to a dataset, and it is not imperative that all kinds of augmentation would necessarily produce a positive effect on a model's performance. Hence there is a need to identify augmentations that perform consistently well across a variety of datasets and also remain invariant to the type of architecture, convolutions, and the number of parameters used. Hence there is a need to identify augmentations that perform consistently well across a variety of datasets and also remain invariant to the type of architecture, convolutions, and the number of parameters used. This paper evaluates the effect of parameters using 3x3 and depth-wise separable convolutions on different augmentation techniques on MNIST, FMNIST, and CIFAR10 datasets. Statistical Evidence shows that techniques such as Cutouts and Random horizontal flip were consistent on both parametrically low and high architectures. Depth-wise separable convolutions outperformed 3x3 convolutions at higher parameters due to their ability to create deeper networks. Augmentations resulted in bridging the accuracy gap between the 3x3 and depth-wise separable convolutions, thus establishing their role in model generalization. At higher number augmentations did not produce a significant change in performance. The synergistic effect of multiple augmentations at higher parameters, with antagonistic effect at lower parameters, was also evaluated. The work proves that a delicate balance between architectural supremacy and augmentations needs to be achieved to enhance a model's performance in any given deep learning task.
    One Loss for Quantization: Deep Hashing with Discrete Wasserstein Distributional Matching. (arXiv:2205.15721v1 [cs.CV])
    Image hashing is a principled approximate nearest neighbor approach to find similar items to a query in a large collection of images. Hashing aims to learn a binary-output function that maps an image to a binary vector. For optimal retrieval performance, producing balanced hash codes with low-quantization error to bridge the gap between the learning stage's continuous relaxation and the inference stage's discrete quantization is important. However, in the existing deep supervised hashing methods, coding balance and low-quantization error are difficult to achieve and involve several losses. We argue that this is because the existing quantization approaches in these methods are heuristically constructed and not effective to achieve these objectives. This paper considers an alternative approach to learning the quantization constraints. The task of learning balanced codes with low quantization error is re-formulated as matching the learned distribution of the continuous codes to a pre-defined discrete, uniform distribution. This is equivalent to minimizing the distance between two distributions. We then propose a computationally efficient distributional distance by leveraging the discrete property of the hash functions. This distributional distance is a valid distance and enjoys lower time and sample complexities. The proposed single-loss quantization objective can be integrated into any existing supervised hashing method to improve code balance and quantization error. Experiments confirm that the proposed approach substantially improves the performance of several representative hashing~methods.
    Lessons Learned from Data-Driven Building Control Experiments: Contrasting Gaussian Process-based MPC, Bilevel DeePC, and Deep Reinforcement Learning. (arXiv:2205.15703v1 [eess.SY])
    This manuscript offers the perspective of experimentalists on a number of modern data-driven techniques: model predictive control relying on Gaussian processes, adaptive data-driven control based on behavioral theory, and deep reinforcement learning. These techniques are compared in terms of data requirements, ease of use, computational burden, and robustness in the context of real-world applications. Our remarks and observations stem from a number of experimental investigations carried out in the field of building control in diverse environments, from lecture halls and apartment spaces to a hospital surgery center. The final goal is to support others in identifying what technique is best suited to tackle their own problems.
    Template based Graph Neural Network with Optimal Transport Distances. (arXiv:2205.15733v1 [cs.LG])
    Current Graph Neural Networks (GNN) architectures generally rely on two important components: node features embedding through message passing, and aggregation with a specialized form of pooling. The structural (or topological) information is implicitly taken into account in these two steps. We propose in this work a novel point of view, which places distances to some learnable graph templates at the core of the graph representation. This distance embedding is constructed thanks to an optimal transport distance: the Fused Gromov-Wasserstein (FGW) distance, which encodes simultaneously feature and structure dissimilarities by solving a soft graph-matching problem. We postulate that the vector of FGW distances to a set of template graphs has a strong discriminative power, which is then fed to a non-linear classifier for final predictions. Distance embedding can be seen as a new layer, and can leverage on existing message passing techniques to promote sensible feature representations. Interestingly enough, in our work the optimal set of template graphs is also learnt in an end-to-end fashion by differentiating through this layer. After describing the corresponding learning procedure, we empirically validate our claim on several synthetic and real life graph classification datasets, where our method is competitive or surpasses kernel and GNN state-of-the-art approaches. We complete our experiments by an ablation study and a sensitivity analysis to parameters.
    Adversarial synthesis based data-augmentation for code-switched spoken language identification. (arXiv:2205.15747v1 [eess.AS])
    Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition(ASR) that is used to classify the language(s) in an audio segment. Automatic LID plays an useful role in multilingual countries. In various countries, identifying a language becomes hard, due to the multilingual scenario where two or more than two languages are mixed together during conversation. Such phenomenon of speech is called as code-mixing or code-switching. This nature is followed not only in India but also in many Asian countries. Such code-mixed data is hard to find, which further reduces the capabilities of the spoken LID. Due to the lack of avalibility of this code-mixed data, it becomes a minority class in LID task. Hence, this work primarily addresses this problem using data augmentation as a solution on the minority code-switched class. This study focuses on Indic language code-mixed with English. Spoken LID is performed on Hindi, code-mixed with English. This research proposes Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data. GANs have already been proven to be accurate in representing the real data distribution in the image domain. Proposed research exploits these capabilities of GANs in speech domains such as speech classification, automatic speech recognition,etc. GANs are trained to generate Mel spectrograms of the minority code-mixed class which are then used to augment data for the classifier. Utilizing GANs give an overall improvement on Unweighted Average Recall by an amount of 3.5\% as compared to a Convolutional Recurrent Neural Network (CRNN) classifier used as the baseline reference.
    Implicitly Regularized RL with Implicit Q-Values. (arXiv:2108.07041v2 [cs.LG] UPDATED)
    The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to $Q$. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax cannot be computed exactly otherwise. Especially the usage of function approximation, to deal with continuous action spaces in modern actor-critic architectures, intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the $Q$-function implicitly, as the sum of a log-policy and of a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.
    Communication-Efficient Distributionally Robust Decentralized Learning. (arXiv:2205.15614v1 [cs.LG])
    Decentralized learning algorithms empower interconnected edge devices to share data and computational resources to collaboratively train a machine learning model without the aid of a central coordinator (e.g. an orchestrating basestation). In the case of heterogeneous data distributions at the network devices, collaboration can yield predictors with unsatisfactory performance for a subset of the devices. For this reason, in this work we consider the formulation of a distributionally robust decentralized learning task and we propose a decentralized single loop gradient descent/ascent algorithm (AD-GDA) to solve the underlying minimax optimization problem. We render our algorithm communication efficient by employing a compressed consensus scheme and we provide convergence guarantees for smooth convex and non-convex loss functions. Finally, we corroborate the theoretical findings with empirical evidence of the ability of the proposed algorithm in providing unbiased predictors over a network of collaborating devices with highly heterogeneous data distributions.
    Snapture -- A Novel Neural Architecture for Combined Static and Dynamic Hand Gesture Recognition. (arXiv:2205.15862v1 [cs.CV])
    As robots are expected to get more involved in people's everyday lives, frameworks that enable intuitive user interfaces are in demand. Hand gesture recognition systems provide a natural way of communication and, thus, are an integral part of seamless Human-Robot Interaction (HRI). Recent years have witnessed an immense evolution of computational models powered by deep learning. However, state-of-the-art models fall short in expanding across different gesture domains, such as emblems and co-speech. In this paper, we propose a novel hybrid hand gesture recognition system. Our architecture enables learning both static and dynamic gestures: by capturing a so-called "snapshot" of the gesture performance at its peak, we integrate the hand pose along with the dynamic movement. Moreover, we present a method for analyzing the motion profile of a gesture to uncover its dynamic characteristics and which allows regulating a static channel based on the amount of motion. Our evaluation demonstrates the superiority of our approach on two gesture benchmarks compared to a CNNLSTM baseline. We also provide an analysis on a gesture class basis that unveils the potential of our Snapture architecture for performance improvements. Thanks to its modular implementation, our framework allows the integration of other multimodal data like facial expressions and head tracking, which are important cues in HRI scenarios, into one architecture. Thus, our work contributes both to gesture recognition research and machine learning applications for non-verbal communication with robots.
    A Meta Reinforcement Learning Approach for Predictive Autoscaling in the Cloud. (arXiv:2205.15795v1 [cs.LG])
    Predictive autoscaling (autoscaling with workload forecasting) is an important mechanism that supports autonomous adjustment of computing resources in accordance with fluctuating workload demands in the Cloud. In recent works, Reinforcement Learning (RL) has been introduced as a promising approach to learn the resource management policies to guide the scaling actions under the dynamic and uncertain cloud environment. However, RL methods face the following challenges in steering predictive autoscaling, such as lack of accuracy in decision-making, inefficient sampling and significant variability in workload patterns that may cause policies to fail at test time. To this end, we propose an end-to-end predictive meta model-based RL algorithm, aiming to optimally allocate resource to maintain a stable CPU utilization level, which incorporates a specially-designed deep periodic workload prediction model as the input and embeds the Neural Process to guide the learning of the optimal scaling actions over numerous application services in the Cloud. Our algorithm not only ensures the predictability and accuracy of the scaling strategy, but also enables the scaling decisions to adapt to the changing workloads with high sample efficiency. Our method has achieved significant performance improvement compared to the existing algorithms and has been deployed online at Alipay, supporting the autoscaling of applications for the world-leading payment platform.
    Continuous Temporal Graph Networks for Event-Based Graph Data. (arXiv:2205.15924v1 [cs.LG])
    There has been an increasing interest in modeling continuous-time dynamics of temporal graph data. Previous methods encode time-evolving relational information into a low-dimensional representation by specifying discrete layers of neural networks, while real-world dynamic graphs often vary continuously over time. Hence, we propose Continuous Temporal Graph Networks (CTGNs) to capture the continuous dynamics of temporal graph data. We use both the link starting timestamps and link duration as evolving information to model the continuous dynamics of nodes. The key idea is to use neural ordinary differential equations (ODE) to characterize the continuous dynamics of node representations over dynamic graphs. We parameterize ordinary differential equations using a novel graph neural network. The existing dynamic graph networks can be considered as a specific discretization of CTGNs. Experiment results on both transductive and inductive tasks demonstrate the effectiveness of our proposed approach over competitive baselines.
    FedHarmony: Unlearning Scanner Bias with Distributed Data. (arXiv:2205.15970v1 [cs.LG])
    The ability to combine data across scanners and studies is vital for neuroimaging, to increase both statistical power and the representation of biological variability. However, combining datasets across sites leads to two challenges: first, an increase in undesirable non-biological variance due to scanner and acquisition differences - the harmonisation problem - and second, data privacy concerns due to the inherently personal nature of medical imaging data, meaning that sharing them across sites may risk violation of privacy laws. To overcome these restrictions, we propose FedHarmony: a harmonisation framework operating in the federated learning paradigm. We show that to remove the scanner-specific effects, we only need to share the mean and standard deviation of the learned features, helping to protect individual subjects' privacy. We demonstrate our approach across a range of realistic data scenarios, using real multi-site data from the ABIDE dataset, thus showing the potential utility of our method for MRI harmonisation across studies. Our code is available at https://github.com/nkdinsdale/FedHarmony.
    Intrinsic Dimension Estimation Using Wasserstein Distances. (arXiv:2106.04018v2 [stat.ML] UPDATED)
    It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.
    SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections. (arXiv:2205.15768v1 [cs.CV])
    Inverse rendering of an object under entirely unknown capture conditions is a fundamental challenge in computer vision and graphics. Neural approaches such as NeRF have achieved photorealistic results on novel view synthesis, but they require known camera poses. Solving this problem with unknown camera poses is highly challenging as it requires joint optimization over shape, radiance, and pose. This problem is exacerbated when the input images are captured in the wild with varying backgrounds and illuminations. Standard pose estimation techniques fail in such image collections in the wild due to very few estimated correspondences across images. Furthermore, NeRF cannot relight a scene under any illumination, as it operates on radiance (the product of reflectance and illumination). We propose a joint optimization framework to estimate the shape, BRDF, and per-image camera pose and illumination. Our method works on in-the-wild online image collections of an object and produces relightable 3D assets for several use-cases such as AR/VR. To our knowledge, our method is the first to tackle this severely unconstrained task with minimal user interaction. Project page: https://markboss.me/publication/2022-samurai/ Video: https://youtu.be/LlYuGDjXp-8
    Few-Shot Unlearning by Model Inversion. (arXiv:2205.15567v1 [cs.LG])
    We consider the problem of machine unlearning to erase a target dataset, which causes an unwanted behavior, from the trained model when the training dataset is not given. Previous works have assumed that the target dataset indicates all the training data imposing the unwanted behavior. However, it is often infeasible to obtain such a complete indication. We hence address a practical scenario of unlearning provided a few samples of target data, so-called few-shot unlearning. To this end, we devise a straightforward framework, including a new model inversion technique to retrieve the training data from the model, followed by filtering out samples similar to the target samples and then relearning. We demonstrate that our method using only a subset of target data can outperform the state-of-the-art methods with a full indication of target data.
    Multi-Agent Learning of Numerical Methods for Hyperbolic PDEs with Factored Dec-MDP. (arXiv:2205.15716v1 [cs.LG])
    Factored decentralized Markov decision process (Dec-MDP) is a framework for modeling sequential decision making problems in multi-agent systems. In this paper, we formalize the learning of numerical methods for hyperbolic partial differential equations (PDEs), specifically the Weighted Essentially Non-Oscillatory (WENO) scheme, as a factored Dec-MDP problem. We show that different reward formulations lead to either reinforcement learning (RL) or behavior cloning, and a homogeneous policy could be learned for all agents under the RL formulation with a policy gradient algorithm. Because the trained agents only act on their local observations, the multi-agent system can be used as a general numerical method for hyperbolic PDEs and generalize to different spatial discretizations, episode lengths, dimensions, and even equation types.
    What Knowledge Gets Distilled in Knowledge Distillation?. (arXiv:2205.16004v1 [cs.CV])
    Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions and more. Our results, using image classification as a case study and three state-of-the-art knowledge distillation techniques, show that knowledge distillation methods can indeed indirectly distill other kinds of properties beyond improving task performance. By exploring these questions, we hope for our work to provide a clearer picture of what happens during knowledge distillation.
    Hollywood Identity Bias Dataset: A Context Oriented Bias Analysis of Movie Dialogues. (arXiv:2205.15951v1 [cs.CL])
    Movies reflect society and also hold power to transform opinions. Social biases and stereotypes present in movies can cause extensive damage due to their reach. These biases are not always found to be the need of storyline but can creep in as the author's bias. Movie production houses would prefer to ascertain that the bias present in a script is the story's demand. Today, when deep learning models can give human-level accuracy in multiple tasks, having an AI solution to identify the biases present in the script at the writing stage can help them avoid the inconvenience of stalled release, lawsuits, etc. Since AI solutions are data intensive and there exists no domain specific data to address the problem of biases in scripts, we introduce a new dataset of movie scripts that are annotated for identity bias. The dataset contains dialogue turns annotated for (i) bias labels for seven categories, viz., gender, race/ethnicity, religion, age, occupation, LGBTQ, and other, which contains biases like body shaming, personality bias, etc. (ii) labels for sensitivity, stereotype, sentiment, emotion, emotion intensity, (iii) all labels annotated with context awareness, (iv) target groups and reason for bias labels and (v) expert-driven group-validation process for high quality annotations. We also report various baseline performances for bias identification and category detection on our dataset.
    On the potential of sequential and non-sequential regression models for Sentinel-1-based biomass prediction in Tanzanian miombo forests. (arXiv:2106.15020v2 [cs.LG] UPDATED)
    This study derives regression models for above-ground biomass (AGB) estimation in miombo woodlands of Tanzania that utilise the high availability and low cost of Sentinel-1 data. The limited forest canopy penetration of C-band SAR sensors along with the sparseness of available ground truth restrict their usefulness in traditional AGB regression models. Therefore, we propose to use AGB predictions based on airborne laser scanning (ALS) data as a surrogate response variable for SAR data. This dramatically increases the available training data and opens for flexible regression models that capture fine-scale AGB dynamics. This becomes a sequential modelling approach, where the first regression stage has linked in situ data to ALS data and produced the AGB prediction map; We perform the subsequent stage, where this map is related to Sentinel-1 data. We develop a traditional, parametric regression model and alternative non-parametric models for this stage. The latter uses a conditional generative adversarial network (cGAN) to translate Sentinel-1 images into ALS-based AGB prediction maps. The convolution filters in the neural networks make them contextual. We compare the sequential models to traditional, non-sequential regression models, all trained on limited AGB ground reference data. Results show that our newly proposed non-sequential Sentinel-1-based regression model performs better quantitatively than the sequential models, but achieves less sensitivity to fine-scale AGB dynamics. The contextual cGAN-based sequential models best reproduce the distribution of ALS-based AGB predictions. They also reach a lower RMSE against in situ AGB data than the parametric sequential model, indicating a potential for further development.
    Simulated Adversarial Testing of Face Recognition Models. (arXiv:2106.04569v3 [cs.CV] UPDATED)
    Most machine learning models are validated and tested on fixed datasets. This can give an incomplete picture of the capabilities and weaknesses of the model. Such weaknesses can be revealed at test time in the real world. The risks involved in such failures can be loss of profits, loss of time or even loss of life in certain critical applications. In order to alleviate this issue, simulators can be controlled in a fine-grained manner using interpretable parameters to explore the semantic image manifold. In this work, we propose a framework for learning how to test machine learning algorithms using simulators in an adversarial manner in order to find weaknesses in the model before deploying it in critical scenarios. We apply this method in a face recognition setup. We show that certain weaknesses of models trained on real data can be discovered using simulated samples. Using our proposed method, we can find adversarial synthetic faces that fool contemporary face recognition models. This demonstrates the fact that these models have weaknesses that are not measured by commonly used validation datasets. We hypothesize that this type of adversarial examples are not isolated, but usually lie in connected spaces in the latent space of the simulator. We present a method to find these adversarial regions as opposed to the typical adversarial points found in the adversarial example literature.
    DeepDefacer: Automatic Removal of Facial Features via U-Net Image Segmentation. (arXiv:2205.15536v1 [cs.CV])
    Recent advancements in the field of magnetic resonance imaging (MRI) have enabled large-scale collaboration among clinicians and researchers for neuroimaging tasks. However, researchers are often forced to use outdated and slow software to anonymize MRI images for publication. These programs specifically perform expensive mathematical operations over 3D images that rapidly slow down anonymization speed as an image's volume increases in size. In this paper, we introduce DeepDefacer, an application of deep learning to MRI anonymization that uses a streamlined 3D U-Net network to mask facial regions in MRI images with a significant increase in speed over traditional de-identification software. We train DeepDefacer on MRI images from the Brain Development Organization (IXI) and International Consortium for Brain Mapping (ICBM) and quantitatively evaluate our model against a baseline 3D U-Net model with regards to Dice, recall, and precision scores. We also evaluate DeepDefacer against Pydeface, a traditional defacing application, with regards to speed on a range of CPU and GPU devices and qualitatively evaluate our model's defaced output versus the ground truth images produced by Pydeface. We provide a link to a PyPi program at the end of this manuscript to encourage further research into the application of deep learning to MRI anonymization.
    Granular Generalized Variable Precision Rough Sets and Rational Approximations. (arXiv:2205.14365v2 [cs.AI] UPDATED)
    Rational approximations are introduced and studied in granular graded sets and generalizations thereof by the first author in recent research papers. The concept of rationality is determined by related ontologies and coherence between granularity, parthood perspective and approximations used in the context. In addition, a framework is introduced by her in the mentioned paper(s). Granular approximations constructed as per the procedures of VPRS are likely to be more rational than those constructed from a classical perspective under certain conditions. This may continue to hold for some generalizations of the former; however, a formal characterization of such conditions is not available in the previously published literature. In this research, theoretical aspects of the problem are critically examined, uniform generalizations of granular VPRS are introduced, new connections with granular graded rough sets are proved, appropriate concepts of substantial parthood are introduced, and their extent of compatibility with the framework is accessed. Furthermore, meta applications to cluster validation, image segmentation and dynamic sorting are invented. Basic assumptions made are explained, and additional examples are constructed for readability.
    k-Means Maximum Entropy Exploration. (arXiv:2205.15623v1 [cs.LG])
    Exploration in high-dimensional, continuous spaces with sparse rewards is an open problem in reinforcement learning. Artificial curiosity algorithms address this by creating rewards that lead to exploration. Given a reinforcement learning algorithm capable of maximizing rewards, the problem reduces to finding an optimization objective consistent with exploration. Maximum entropy exploration uses the entropy of the state visitation distribution as such an objective. However, efficiently estimating the entropy of the state visitation distribution is challenging in high-dimensional, continuous spaces. We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution. The bound relies on a result for non-parametric density estimation in arbitrary dimensions using k-means. We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces, especially on tasks where reinforcement learning algorithms are unable to find rewards.
    Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization. (arXiv:2205.15784v1 [stat.CO])
    Bayesian Likelihood-Free Inference methods yield posterior approximations for simulator models with intractable likelihood. Recently, many works trained neural networks to approximate either the intractable likelihood or the posterior directly. Most proposals use normalizing flows, namely neural networks parametrizing invertible maps used to transform samples from an underlying base measure; the probability density of the transformed samples is then accessible and the normalizing flow can be trained via maximum likelihood on simulated parameter-observation pairs. A recent work [Ramesh et al., 2022] approximated instead the posterior with generative networks, which drop the invertibility requirement and are thus a more flexible class of distributions scaling to high-dimensional and structured data. However, generative networks only allow sampling from the parametrized distribution; for this reason, Ramesh et al. [2022] follows the common solution of adversarial training, where the generative network plays a min-max game against a "critic" network. This procedure is unstable and can lead to a learned distribution underestimating the uncertainty - in extreme cases collapsing to a single point. Here, we propose to approximate the posterior with generative networks trained by Scoring Rule minimization, an overlooked adversarial-free method enabling smooth training and better uncertainty quantification. In simulation studies, the Scoring Rule approach yields better performances with shorter training time with respect to the adversarial framework.
    Semantic Autoencoder and Its Potential Usage for Adversarial Attack. (arXiv:2205.15592v1 [cs.LG])
    Autoencoder can give rise to an appropriate latent representation of the input data, however, the representation which is solely based on the intrinsic property of the input data, is usually inferior to express some semantic information. A typical case is the potential incapability of forming a clear boundary upon clustering of these representations. By encoding the latent representation that not only depends on the content of the input data, but also the semantic of the input data, such as label information, we propose an enhanced autoencoder architecture named semantic autoencoder. Experiments of representation distribution via t-SNE shows a clear distinction between these two types of encoders and confirm the supremacy of the semantic one, whilst the decoded samples of these two types of autoencoders exhibit faint dissimilarity either objectively or subjectively. Based on this observation, we consider adversarial attacks to learning algorithms that rely on the latent representation obtained via autoencoders. It turns out that latent contents of adversarial samples constructed from semantic encoder with deliberate wrong label information exhibit different distribution compared with that of the original input data, while both of these samples manifest very marginal difference. This new way of attack set up by our work is worthy of attention due to the necessity to secure the widespread deep learning applications.
    Label-Enhanced Graph Neural Network for Semi-supervised Node Classification. (arXiv:2205.15653v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely applied in the semi-supervised node classification task, where a key point lies in how to sufficiently leverage the limited but valuable label information. Most of the classical GNNs solely use the known labels for computing the classification loss at the output. In recent years, several methods have been designed to additionally utilize the labels at the input. One part of the methods augment the node features via concatenating or adding them with the one-hot encodings of labels, while other methods optimize the graph structure by assuming neighboring nodes tend to have the same label. To bring into full play the rich information of labels, in this paper, we present a label-enhanced learning framework for GNNs, which first models each label as a virtual center for intra-class nodes and then jointly learns the representations of both nodes and labels. Our approach could not only smooth the representations of nodes belonging to the same class, but also explicitly encode the label semantics into the learning process of GNNs. Moreover, a training node selection technique is provided to eliminate the potential label leakage issue and guarantee the model generalization ability. Finally, an adaptive self-training strategy is proposed to iteratively enlarge the training set with more reliable pseudo labels and distinguish the importance of each pseudo-labeled node during the model training process. Experimental results on both real-world and synthetic datasets demonstrate our approach can not only consistently outperform the state-of-the-arts, but also effectively smooth the representations of intra-class nodes.
    A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting. (arXiv:2205.15580v1 [cs.LG])
    We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, compressed communication, and partial participation. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Moreover, we observe that "1 + 1 + 1 is not 3": by mixing variance reduction of stochastic gradients with compressed communication and partial participation, we do not obtain a fully synergetic effect. We explain the nature of this phenomenon, argue that this is to be expected, and propose possible workarounds.
    GlanceNets: Interpretabile, Leak-proof Concept-based Models. (arXiv:2205.15612v1 [cs.LG])
    There is growing interest in concept-based models (CBMs) that combine high-performance and interpretability by acquiring and reasoning with a vocabulary of high-level concepts. A key requirement is that the concepts be interpretable. Existing CBMs tackle this desideratum using a variety of heuristics based on unclear notions of interpretability, and fail to acquire concepts with the intended semantics. We address this by providing a clear definition of interpretability in terms of alignment between the model's representation and an underlying data generation process, and introduce GlanceNets, a new CBM that exploits techniques from disentangled representation learning and open-set recognition to achieve alignment, thus improving the interpretability of the learned concepts. We show that GlanceNets, paired with concept-level supervision, achieve better alignment than state-of-the-art approaches while preventing spurious information from unintendedly leaking into the learned concepts.
    HyperMAML: Few-Shot Adaptation of Deep Models with Hypernetworks. (arXiv:2205.15745v1 [cs.LG])
    The aim of Few-Shot learning methods is to train models which can easily adapt to previously unseen tasks, based on small amounts of data. One of the most popular and elegant Few-Shot learning approaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this method is to learn the general weights of the meta-model, which are further adapted to specific problems in a small number of gradient steps. However, the model's main limitation lies in the fact that the update procedure is realized by gradient-based optimisation. In consequence, MAML cannot always modify weights to the essential level in one or even a few gradient iterations. On the other hand, using many gradient steps results in a complex and time-consuming optimization procedure, which is hard to train in practice, and may lead to overfitting. In this paper, we propose HyperMAML, a novel generalization of MAML, where the training of the update procedure is also part of the model. Namely, in HyperMAML, instead of updating the weights with gradient descent, we use for this purpose a trainable Hypernetwork. Consequently, in this framework, the model can generate significant updates whose range is not limited to a fixed number of gradient steps. Experiments show that HyperMAML consistently outperforms MAML and performs comparably to other state-of-the-art techniques in a number of standard Few-Shot learning benchmarks.
    Concept-level Debugging of Part-Prototype Networks. (arXiv:2205.15769v1 [cs.LG])
    Part-prototype Networks (ProtoPNets) are concept-based classifiers designed to achieve the same performance as black-box models without compromising transparency. ProtoPNets compute predictions based on similarity to class-specific part-prototypes learned to recognize parts of training examples, making it easy to faithfully determine what examples are responsible for any target prediction and why. However, like other models, they are prone to picking up confounds and shortcuts from the data, thus suffering from compromised prediction accuracy and limited generalization. We propose ProtoPDebug, an effective concept-level debugger for ProtoPNets in which a human supervisor, guided by the model's explanations, supplies feedback in the form of what part-prototypes must be forgotten or kept, and the model is fine-tuned to align with this supervision. An extensive empirical evaluation on synthetic and real-world data shows that ProtoPDebug outperforms state-of-the-art debuggers for a fraction of the annotation cost.
    Investigating the Role of Image Retrieval for Visual Localization -- An exhaustive benchmark. (arXiv:2205.15761v1 [cs.CV])
    Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two purposes: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for both of them. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes which often differs from the requirements of visual localization. In order to investigate the consequences for visual localization, this paper focuses on understanding the role of image retrieval for multiple visual localization paradigms. First, we introduce a novel benchmark setup and compare state-of-the-art retrieval representations on multiple datasets using localization performance as metric. Second, we investigate several definitions of "ground truth" for image retrieval. Using these definitions as upper bounds for the visual localization paradigms, we show that there is still sgnificant room for improvement. Third, using these tools and in-depth analysis, we show that retrieval performance on classical landmark retrieval or place recognition tasks correlates only for some but not all paradigms to localization performance. Finally, we analyze the effects of blur and dynamic scenes in the images. We conclude that there is a need for retrieval approaches specifically designed for localization paradigms. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.
    You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments. (arXiv:2205.15967v1 [cs.LG])
    Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines.
    Contrastive Representation Learning for 3D Protein Structures. (arXiv:2205.15675v1 [q-bio.BM])
    Learning from 3D protein structures has gained wide interest in protein modeling and structural bioinformatics. Unfortunately, the number of available structures is orders of magnitude lower than the training data sizes commonly used in computer vision and machine learning. Moreover, this number is reduced even further, when only annotated protein structures can be considered, making the training of existing models difficult and prone to over-fitting. To address this challenge, we introduce a new representation learning framework for 3D protein structures. Our framework uses unsupervised contrastive learning to learn meaningful representations of protein structures, making use of proteins from the Protein Data Bank. We show, how these representations can be used to solve a large variety of tasks, such as protein function prediction, protein fold classification, structural similarity prediction, and protein-ligand binding affinity prediction. Moreover, we show how fine-tuned networks, pre-trained with our algorithm, lead to significantly improved task performance, achieving new state-of-the-art results in many tasks.
    Differentiable Invariant Causal Discovery. (arXiv:2205.15638v1 [cs.LG])
    Learning causal structure from observational data is a fundamental challenge in machine learning. The majority of commonly used differentiable causal discovery methods are non-identifiable, turning this problem into a continuous optimization task prone to data biases. In many real-life situations, data is collected from different environments, in which the functional relations remain consistent across environments, while the distribution of additive noises may vary. This paper proposes Differentiable Invariant Causal Discovery (DICD), utilizing the multi-environment information based on a differentiable framework to avoid learning spurious edges and wrong causal directions. Specifically, DICD aims to discover the environment-invariant causation while removing the environment-dependent correlation. We further formulate the constraint that enforces the target structure equation model to maintain optimal across the environments. Theoretical guarantees for the identifiability of proposed DICD are provided under mild conditions with enough environments. Extensive experiments on synthetic and real-world datasets verify that DICD outperforms state-of-the-art causal discovery methods up to 36% in SHD. Our code will be open-sourced upon acceptance.
    SymFormer: End-to-end symbolic regression using transformer-based architecture. (arXiv:2205.15764v1 [cs.CV])
    Novel view synthesis is a long-standing problem. In this work, we consider a variant of the problem where we are given only a few context views sparsely covering a scene or an object. The goal is to predict novel viewpoints in the scene, which requires learning priors. The current state of the art is based on Neural Radiance Fields (NeRFs), and while achieving impressive results, the methods suffer from long training times as they require evaluating thousands of 3D point samples via a deep neural network for each image. We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network. Our model uses a two-stage architecture consisting of a codebook and a transformer model. The codebook is used to embed individual images into a smaller latent space, and the transformer solves the view synthesis task in this more compact space. To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation. Experimental results on real-world scenes show that our approach is competitive compared to NeRF-based methods while not reasoning in 3D, and it is faster to train.
    Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish. (arXiv:2205.15712v1 [cs.CL])
    Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset -- ProductMatch.pl -- that is entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.
    Robust Projection based Anomaly Extraction (RPE) in Univariate Time-Series. (arXiv:2205.15548v1 [stat.ML])
    This paper presents a novel, closed-form, and data/computation efficient online anomaly detection algorithm for time-series data. The proposed method, dubbed RPE, is a window-based method and in sharp contrast to the existing window-based methods, it is robust to the presence of anomalies in its window and it can distinguish the anomalies in time-stamp level. RPE leverages the linear structure of the trajectory matrix of the time-series and employs a robust projection step which makes the algorithm able to handle the presence of multiple arbitrarily large anomalies in its window. A closed-form/non-iterative algorithm for the robust projection step is provided and it is proved that it can identify the corrupted time-stamps. RPE is a great candidate for the applications where a large training data is not available which is the common scenario in the area of time-series. An extensive set of numerical experiments show that RPE can outperform the existing approaches with a notable margin.
    HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness. (arXiv:2205.15568v1 [cs.LG])
    The process of optimizing the latency of DNN operators with ML models and hardware-in-the-loop, called auto-tuning, has established itself as a pervasive method for the deployment of neural networks. From a search space of loop-optimizations, the candidate providing the best performance has to be selected. Performance of individual configurations is evaluated through hardware measurements. The combinatorial explosion of possible configurations, together with the cost of hardware evaluation makes exhaustive explorations of the search space infeasible in practice. Machine Learning methods, like random forests or reinforcement learning are used to aid in the selection of candidates for hardware evaluation. For general purpose hardware like x86 and GPGPU architectures impressive performance gains can be achieved, compared to hand-optimized libraries like cuDNN. The method is also useful in the space of hardware accelerators with less wide-spread adoption, where a high-performance library is not always available. However, hardware accelerators are often less flexible with respect to their programming which leads to operator configurations not executable on the hardware target. This work evaluates how these invalid configurations affect the auto-tuning process and its underlying performance prediction model for the VTA hardware. From these results, a validity-driven initialization method for AutoTVM is developed, only requiring 41.6% of the necessary hardware measurements to find the best solution, while improving search robustness.
    Automatic Relation-aware Graph Network Proliferation. (arXiv:2205.15678v1 [cs.LG])
    Graph neural architecture search has sparked much attention as Graph Neural Networks (GNNs) have shown powerful reasoning capability in many relational tasks. However, the currently used graph search space overemphasizes learning node features and neglects mining hierarchical relational information. Moreover, due to diverse mechanisms in the message passing, the graph search space is much larger than that of CNNs. This hinders the straightforward application of classical search strategies for exploring complicated graph search space. We propose Automatic Relation-aware Graph Network Proliferation (ARGNP) for efficiently searching GNNs with a relation-guided message passing mechanism. Specifically, we first devise a novel dual relation-aware graph search space that comprises both node and relation learning operations. These operations can extract hierarchical node/relational information and provide anisotropic guidance for message passing on a graph. Second, analogous to cell proliferation, we design a network proliferation search paradigm to progressively determine the GNN architectures by iteratively performing network division and differentiation. The experiments on six datasets for four graph learning tasks demonstrate that GNNs produced by our method are superior to the current state-of-the-art hand-crafted and search-based GNNs. Codes are available at https://github.com/phython96/ARGNP.
    Secure Federated Clustering. (arXiv:2205.15564v1 [cs.LG])
    We consider a foundational unsupervised learning task of $k$-means data clustering, in a federated learning (FL) setting consisting of a central server and many distributed clients. We develop SecFC, which is a secure federated clustering algorithm that simultaneously achieves 1) universal performance: no performance loss compared with clustering over centralized data, regardless of data distribution across clients; 2) data privacy: each client's private data and the cluster centers are not leaked to other clients and the server. In SecFC, the clients perform Lagrange encoding on their local data and share the coded data in an information-theoretically private manner; then leveraging the algebraic structure of the coding, the FL network exactly executes the Lloyd's $k$-means heuristic over the coded data to obtain the final clustering. Experiment results on synthetic and real datasets demonstrate the universally superior performance of SecFC for different data distributions across clients, and its computational practicality for various combinations of system parameters. Finally, we propose an extension of SecFC to further provide membership privacy for all data points.
    Scalable Distributional Robustness in a Class of Non Convex Optimization with Guarantees. (arXiv:2205.15624v1 [cs.LG])
    Distributionally robust optimization (DRO) has shown lot of promise in providing robustness in learning as well as sample based optimization problems. We endeavor to provide DRO solutions for a class of sum of fractionals, non-convex optimization which is used for decision making in prominent areas such as facility location and security games. In contrast to previous work, we find it more tractable to optimize the equivalent variance regularized form of DRO rather than the minimax form. We transform the variance regularized form to a mixed-integer second order cone program (MISOCP), which, while guaranteeing near global optimality, does not scale enough to solve problems with real world data-sets. We further propose two abstraction approaches based on clustering and stratified sampling to increase scalability, which we then use for real world data-sets. Importantly, we provide near global optimality guarantees for our approach and show experimentally that our solution quality is better than the locally optimal ones achieved by state-of-the-art gradient-based methods. We experimentally compare our different approaches and baselines, and reveal nuanced properties of a DRO solution.
    Multi-task Optimization Based Co-training for Electricity Consumption Prediction. (arXiv:2205.15663v1 [cs.LG])
    Real-world electricity consumption prediction may involve different tasks, e.g., prediction for different time steps ahead or different geo-locations. These tasks are often solved independently without utilizing some common problem-solving knowledge that could be extracted and shared among these tasks to augment the performance of solving each task. In this work, we propose a multi-task optimization (MTO) based co-training (MTO-CT) framework, where the models for solving different tasks are co-trained via an MTO paradigm in which solving each task may benefit from the knowledge gained from when solving some other tasks to help its solving process. MTO-CT leverages long short-term memory (LSTM) based model as the predictor where the knowledge is represented via connection weights and biases. In MTO-CT, an inter-task knowledge transfer module is designed to transfer knowledge between different tasks, where the most helpful source tasks are selected by using the probability matching and stochastic universal selection, and evolutionary operations like mutation and crossover are performed for reusing the knowledge from selected source tasks in a target task. We use electricity consumption data from five states in Australia to design two sets of tasks at different scales: a) one-step ahead prediction for each state (five tasks) and b) 6-step, 12-step, 18-step, and 24-step ahead prediction for each state (20 tasks). The performance of MTO-CT is evaluated on solving each of these two sets of tasks in comparison to solving each task in the set independently without knowledge sharing under the same settings, which demonstrates the superiority of MTO-CT in terms of prediction accuracy.
    GSR: A Generalized Symbolic Regression Approach. (arXiv:2205.15569v1 [cs.LG])
    Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method outperforms several state-of-the-art methods on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.
    The CLRS Algorithmic Reasoning Benchmark. (arXiv:2205.15659v1 [cs.LG])
    Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms. Several important works have investigated whether neural networks can effectively reason like algorithms, typically by learning to execute them. The common trend in the area, however, is to generate targeted kinds of algorithmic data to evaluate specific hypotheses, making results hard to transfer across publications, and increasing the barrier of entry. To consolidate progress and work towards unified evaluation, we propose the CLRS Algorithmic Reasoning Benchmark, covering classical algorithms from the Introduction to Algorithms textbook. Our benchmark spans a variety of algorithmic reasoning procedures, including sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms. We perform extensive experiments to demonstrate how several popular algorithmic reasoning baselines perform on these tasks, and consequently, highlight links to several open challenges. Our library is readily available at https://github.com/deepmind/clrs.
    VC Theoretical Explanation of Double Descent. (arXiv:2205.15549v1 [stat.ML])
    There has been growing interest in generalization performance of large multilayer neural networks that can be trained to achieve zero training error, while generalizing well on test data. This regime is known as 'second descent' and it appears to contradict conventional view that optimal model complexity should reflect optimal balance between underfitting and overfitting, aka the bias-variance trade-off. This paper presents VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC generalization bounds. We illustrate an application of analytic VC-bounds for modeling double descent for classification problems, using empirical results for several learning methods, such as SVM, Least Squares, and Multilayer Perceptron classifiers. In addition, we discuss several possible reasons for misinterpretation of VC-theoretical results in the machine learning community.
    Simulation-Based Inference with WALDO: Perfectly Calibrated Confidence Regions Using Any Prediction or Posterior Estimation Algorithm. (arXiv:2205.15680v1 [stat.ML])
    The vast majority of modern machine learning targets prediction problems, with algorithms such as Deep Neural Networks revolutionizing the accuracy of point predictions for high-dimensional complex data. Predictive approaches are now used in many domain sciences to directly estimate internal parameters of interest in theoretical simulator-based models. In parallel, common alternatives focus on estimating the full posterior using modern neural density estimators such as normalizing flows. However, an open problem in simulation-based inference (SBI) is how to construct properly calibrated confidence regions for internal parameters with nominal conditional coverage and high power. Many SBI methods are indeed known to produce overly confident posterior approximations, yielding misleading uncertainty estimates. Similarly, existing approaches for uncertainty quantification in deep learning provide no guarantees on conditional coverage. In this work, we present WALDO, a novel method for constructing correctly calibrated confidence regions in SBI. WALDO reframes the well-known Wald test and uses Neyman inversion to convert point predictions and posteriors from any prediction or posterior estimation algorithm to confidence sets with correct conditional coverage, even for finite sample sizes. As a concrete example, we demonstrate how a recently proposed deep learning prediction approach for particle energies in high-energy physics can be recalibrated using WALDO to produce confidence intervals with correct coverage and high power.
    Mitigating Dataset Bias by Using Per-sample Gradient. (arXiv:2205.15704v1 [cs.LG])
    The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels involving human or empirical correlation metrics (e.g., training loss). However, such metrics require human costs or have insufficient theoretical explanation. In this study, we propose a debiasing algorithm, called PGD (Per-sample Gradient-based Debiasing), that comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various synthetic and real-world datasets, the proposed method showed state-of-the-art accuracy for a the classification task. Furthermore, we describe theoretical understandings about how PGD can mitigate dataset bias.
    Individual health-disease phase diagrams for disease prevention based on machine learning. (arXiv:2205.15598v1 [cs.LG])
    Early disease detection and prevention methods based on effective interventions are gaining attention. Machine learning technology has enabled precise disease prediction by capturing individual differences in multivariate data. Progress in precision medicine has revealed that substantial heterogeneity exists in health data at the individual level and that complex health factors are involved in the development of chronic diseases. However, it remains a challenge to identify individual physiological state changes in cross-disease onset processes because of the complex relationships among multiple biomarkers. Here, we present the health-disease phase diagram (HDPD), which represents a personal health state by visualizing the boundary values of multiple biomarkers that fluctuate early in the disease progression process. In HDPDs, future onset predictions are represented by perturbing multiple biomarker values while accounting for dependencies among variables. We constructed HDPDs for 11 non-communicable diseases (NCDs) from a longitudinal health checkup cohort of 3,238 individuals, comprising 3,215 measurement items and genetic data. Improvement of biomarker values to the non-onset region in HDPD significantly prevented future disease onset in 7 out of 11 NCDs. Our results demonstrate that HDPDs can represent individual physiological states in the onset process and be used as intervention goals for disease prevention.
    A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. (arXiv:2006.14171v3 [cs.LG] UPDATED)
    In recent years, Deep Reinforcement Learning (DRL) algorithms have achieved state-of-the-art performance in many challenging strategy games. Because these games have complicated rules, an action sampled from the full discrete action distribution predicted by the learned policy is likely to be invalid according to the game rules (e.g., walking into a wall). The usual approach to deal with this problem in policy gradient algorithms is to "mask out" invalid actions and just sample from the set of valid actions. The implications of this process, however, remain under-investigated. In this paper, we 1) show theoretical justification for such a practice, 2) empirically demonstrate its importance as the space of invalid actions grows, and 3) provide further insights by evaluating different action masking regimes, such as removing masking after an agent has been trained using masking. The source code can be found at https://github.com/vwxyzjn/invalid-action-masking
    ViNNPruner: Visual Interactive Pruning for Deep Learning. (arXiv:2205.15731v1 [cs.LG])
    Neural networks grow vastly in size to tackle more sophisticated tasks. In many cases, such large networks are not deployable on particular hardware and need to be reduced in size. Pruning techniques help to shrink deep neural networks to smaller sizes by only decreasing their performance as little as possible. However, such pruning algorithms are often hard to understand by applying them and do not include domain knowledge which can potentially be bad for user goals. We propose ViNNPruner, a visual interactive pruning application that implements state-of-the-art pruning algorithms and the option for users to do manual pruning based on their knowledge. We show how the application facilitates gaining insights into automatic pruning algorithms and semi-automatically pruning oversized networks to make them more efficient using interactive visualizations.
    Goal-Aware Neural SAT Solver. (arXiv:2106.07162v2 [cs.LG] UPDATED)
    Modern neural networks obtain information about the problem and calculate the output solely from the input values. We argue that it is not always optimal, and the network's performance can be significantly improved by augmenting it with a query mechanism that allows the network at run time to make several solution trials and get feedback on the loss value on each trial. To demonstrate the capabilities of the query mechanism, we formulate an unsupervised (not depending on labels) loss function for Boolean Satisfiability Problem (SAT) and theoretically show that it allows the network to extract rich information about the problem. We then propose a neural SAT solver with a query mechanism called QuerySAT and show that it outperforms the neural baseline on a wide range of SAT tasks.
    Unsupervised Image Representation Learning with Deep Latent Particles. (arXiv:2205.15821v1 [cs.CV])
    We propose a new representation of visual data that disentangles object position from appearance. Our method, termed Deep Latent Particles (DLP), decomposes the visual input into low-dimensional latent ``particles'', where each particle is described by its spatial location and features of its surrounding region. To drive learning of such representations, we follow a VAE-based approach and introduce a prior for particle positions based on a spatial-softmax architecture, and a modification of the evidence lower bound loss inspired by the Chamfer distance between particles. We demonstrate that our DLP representations are useful for downstream tasks such as unsupervised keypoint (KP) detection, image manipulation, and video prediction for scenes composed of multiple dynamic objects. In addition, we show that our probabilistic interpretation of the problem naturally provides uncertainty estimates for particle locations, which can be used for model selection, among other tasks. Videos and code are available: https://taldatech.github.io/deep-latent-particles-web/
    The Computational Drug Repositioning without Negative Sampling. (arXiv:2111.14696v3 [cs.LG] UPDATED)
    Computational drug repositioning technology is an effective tool to accelerate drug development. Although this technique has been widely used and successful in recent decades, many existing models still suffer from multiple drawbacks such as the massive number of unvalidated drug-disease associations and the inner product. The limitations of these works are mainly due to the following two reasons: firstly, previous works used negative sampling techniques to treat unvalidated drug-disease associations as negative samples, which is invalid in real-world settings; secondly, the inner product cannot fully take into account the feature information contained in the latent factor of drug and disease. In this paper, we propose a novel PUON framework for addressing the above deficiencies, which models the risk estimator of computational drug repositioning only using validated (Positive) and unvalidated (Unlabelled) drug-disease associations without employing negative sampling techniques. The PUON also proposed an Outer Neighborhood-based classifier for modeling the cross-feature information of the latent facotor. For a comprehensive comparison, we considered 8 popular baselines. Extensive experiments in four real-world datasets showed that PUON model achieved the best performance based on 6 evaluation metrics.
    Automatic Diagnosis of Schizophrenia and Attention Deficit Hyperactivity Disorder in rs-fMRI Modality using Convolutional Autoencoder Model and Interval Type-2 Fuzzy Regression. (arXiv:2205.15858v1 [cs.LG])
    Nowadays, many people worldwide suffer from brain disorders, and their health is in danger. So far, numerous methods have been proposed for the diagnosis of Schizophrenia (SZ) and attention deficit hyperactivity disorder (ADHD), among which functional magnetic resonance imaging (fMRI) modalities are known as a popular method among physicians. This paper presents an SZ and ADHD intelligent detection method of resting-state fMRI (rs-fMRI) modality using a new deep learning (DL) method. The University of California Los Angeles (UCLA) dataset, which contains the rs-fMRI modalities of SZ and ADHD patients, has been used for experiments. The FMRIB software library (FSL) toolbox first performed preprocessing on rs-fMRI data. Then, a convolutional Autoencoder (CNN-AE) model with the proposed number of layers is used to extract features from rs-fMRI data. In the classification step, a new fuzzy method called interval type-2 fuzzy regression (IT2FR) is introduced and then optimized by genetic algorithm (GA), particle swarm optimization (PSO), and gray wolf optimization (GWO) techniques. Also, the results of IT2FR methods are compared with multilayer perceptron (MLP), k-nearest neighbors (KNN), support vector machine (SVM), random forest (RF), decision tree (DT), and adaptive neuro-fuzzy inference system (ANFIS) methods. The experiment results show that the IT2FR method with the GWO optimization algorithm has achieved satisfactory results compared to other classifier methods. Finally, the proposed classification technique was able to provide 72.71% accuracy.
    Mean Field inference of CRFs based on GAT. (arXiv:2205.15312v1 [cs.LG])
    In this paper we propose an improved mean-field inference algorithm for the fully connected paired CRFs model. The improved method Message Passing operation is changed from the original linear convolution to the present graph attention operation, while the process of the inference algorithm is turned into the forward process of the GAT model. Combined with the mean-field inferred label distribution, it is equivalent to the output of a classifier with only unary potential. To this end, we propose a graph attention network model with residual structure, and the model approach is applicable to all sequence annotation tasks, such as pixel-level image semantic segmentation tasks as well as text annotation tasks.
    Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game. (arXiv:2205.15512v1 [cs.LG])
    Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimal performance has only been (nearly) achieved for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose two new algorithms, SPEVI+ and SPMVI+, for single-agent MDPs and two-player zero-sum Markov games (MGs), respectively. The proposed algorithms feature carefully crafted data splitting mechanisms and novel variance-reduction pessimistic estimators. Theoretical analysis demonstrates that they are capable of matching the performance lower bounds up to logarithmic factors. As a byproduct, a new performance lower bound is established for MGs, which tightens the existing results. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
    Variational Transfer Learning using Cross-Domain Latent Modulation. (arXiv:2205.15523v1 [cs.LG])
    To successfully apply trained neural network models to new domains, powerful transfer learning solutions are essential. We propose to introduce a novel cross-domain latent modulation mechanism to a variational autoencoder framework so as to achieve effective transfer learning. Our key idea is to procure deep representations from one data domain and use it to influence the reparameterization of the latent variable of another domain. Specifically, deep representations of the source and target domains are first extracted by a unified inference model and aligned by employing gradient reversal. The learned deep representations are then cross-modulated to the latent encoding of the alternative domain, where consistency constraints are also applied. In the empirical validation that includes a number of transfer learning benchmark tasks for unsupervised domain adaptation and image-to-image translation, our model demonstrates competitive performance, which is also supported by evidence obtained from visualization.
    Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning. (arXiv:2205.15367v1 [cs.LG])
    We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.
    A hybrid approach to seismic deblending: when physics meets self-supervision. (arXiv:2205.15395v1 [physics.geo-ph])
    To limit the time, cost, and environmental impact associated with the acquisition of seismic data, in recent decades considerable effort has been put into so-called simultaneous shooting acquisitions, where seismic sources are fired at short time intervals between each other. As a consequence, waves originating from consecutive shots are entangled within the seismic recordings, yielding so-called blended data. For processing and imaging purposes, the data generated by each individual shot must be retrieved. This process, called deblending, is achieved by solving an inverse problem which is heavily underdetermined. Conventional approaches rely on transformations that render the blending noise into burst-like noise, whilst preserving the signal of interest. Compressed sensing type regularization is then applied, where sparsity in some domain is assumed for the signal of interest. The domain of choice depends on the geometry of the acquisition and the properties of seismic data within the chosen domain. In this work, we introduce a new concept that consists of embedding a self-supervised denoising network into the Plug-and-Play (PnP) framework. A novel network is introduced whose design extends the blind-spot network architecture of [28 ] for partially coherent noise (i.e., correlated in time). The network is then trained directly on the noisy input data at each step of the PnP algorithm. By leveraging both the underlying physics of the problem and the great denoising capabilities of our blind-spot network, the proposed algorithm is shown to outperform an industry-standard method whilst being comparable in terms of computational cost. Moreover, being independent on the acquisition geometry, our method can be easily applied to both marine and land data without any significant modification.
    Bayesian Active Learning for Scanning Probe Microscopy: from Gaussian Processes to Hypothesis Learning. (arXiv:2205.15458v1 [cond-mat.mtrl-sci])
    Recent progress in machine learning methods, and the emerging availability of programmable interfaces for scanning probe microscopes (SPMs), have propelled automated and autonomous microscopies to the forefront of attention of the scientific community. However, enabling automated microscopy requires the development of task-specific machine learning methods, understanding the interplay between physics discovery and machine learning, and fully defined discovery workflows. This, in turn, requires balancing the physical intuition and prior knowledge of the domain scientist with rewards that define experimental goals and machine learning algorithms that can translate these to specific experimental protocols. Here, we discuss the basic principles of Bayesian active learning and illustrate its applications for SPM. We progress from the Gaussian Process as a simple data-driven method and Bayesian inference for physical models as an extension of physics-based functional fits to more complex deep kernel learning methods, structured Gaussian Processes, and hypothesis learning. These frameworks allow for the use of prior data, the discovery of specific functionalities as encoded in spectral data, and exploration of physical laws manifesting during the experiment. The discussed framework can be universally applied to all techniques combining imaging and spectroscopy, SPM methods, nanoindentation, electron microscopy and spectroscopy, and chemical imaging methods, and can be particularly impactful for destructive or irreversible measurements.
    Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games. (arXiv:2205.15879v1 [cs.AI])
    Learning to play optimally against any mixture over a diverse set of strategies is of important practical interests in competitive games. In this paper, we propose simplex-NeuPL that satisfies two desiderata simultaneously: i) learning a population of strategically diverse basis policies, represented by a single conditional network; ii) using the same network, learn best-responses to any mixture over the simplex of basis policies. We show that the resulting conditional policies incorporate prior information about their opponents effectively, enabling near optimal returns against arbitrary mixture policies in a game with tractable best-responses. We verify that such policies behave Bayes-optimally under uncertainty and offer insights in using this flexibility at test time. Finally, we offer evidence that learning best-responses to any mixture policies is an effective auxiliary task for strategic exploration, which, by itself, can lead to more performant populations.
    Graph Backup: Data Efficient Backup Exploiting Markovian Transitions. (arXiv:2205.15824v1 [cs.LG])
    The successes of deep Reinforcement Learning (RL) are limited to settings where we have a large stream of online experiences, but applying RL in the data-efficient setting with limited access to online interactions is still challenging. A key to data-efficient RL is good value estimation, but current methods in this space fail to fully utilise the structure of the trajectory data gathered from the environment. In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. Compared to multi-step backup methods such as $n$-step $Q$-Learning and TD($\lambda$), Graph Backup can perform counterfactual credit assignment and gives stable value estimates for a state regardless of which trajectory the state is sampled from. Our method, when combined with popular value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks including MiniGrid, Minatar and Atari100K. We further analyse the reasons for this performance boost through a novel visualisation of the transition graphs of Atari games.
    Rethinking Graph Neural Networks for Anomaly Detection. (arXiv:2205.15508v1 [cs.LG])
    Graph Neural Networks (GNNs) are widely applied for graph anomaly detection. As one of the key components for GNN design is to select a tailored spectral filter, we take the first step towards analyzing anomalies via the lens of the graph spectrum. Our crucial observation is the existence of anomalies will lead to the `right-shift' phenomenon, that is, the spectral energy distribution concentrates less on low frequencies and more on high frequencies. This fact motivates us to propose the Beta Wavelet Graph Neural Network (BWGNN). Indeed, BWGNN has spectral and spatial localized band-pass filters to better handle the `right-shift' phenomenon in anomalies. We demonstrate the effectiveness of BWGNN on four large-scale anomaly detection datasets. Our code and data are released at https://github.com/squareRoot3/Rethinking-Anomaly-Detection
    SOM-CPC: Unsupervised Contrastive Learning with Self-Organizing Maps for Structured Representations of High-Rate Time Series. (arXiv:2205.15875v1 [cs.LG])
    Continuous monitoring with an ever-increasing number of sensors has become ubiquitous across many application domains. Acquired data are typically high-dimensional and difficult to interpret, but they are also hypothesized to lie on a lower-dimensional manifold. Many deep learning (DL) models aim to identify this manifold, but do not promote structure nor interpretability. We propose the SOM-CPC model, which jointly optimizes Contrastive Predictive Coding (CPC), and a Self-Organizing Map (SOM) to find such an organized manifold. We address a largely unexplored and challenging set of scenarios comprising high-rate time series, and show on synthetic and real-life medical and audio data that SOM-CPC outperforms strong baseline models that combine DL with SOMs. SOM-CPC has great potential to expose latent patterns in high-rate data streams, and may therefore contribute to a better understanding of many different processes and systems.
    Few-Shot Diffusion Models. (arXiv:2205.15463v1 [cs.CV])
    Denoising diffusion probabilistic models (DDPM) are powerful hierarchical latent variable models with remarkable sample generation quality and training stability. These properties can be attributed to parameter sharing in the generative hierarchy, as well as a parameter-free diffusion-based inference procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs. FSDMs are trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information using a set-based Vision Transformer (ViT). At test time, the model is able to generate samples from previously unseen classes conditioned on as few as 5 samples from that class. We empirically show that FSDM can perform few-shot generation and transfer to new datasets. We benchmark variants of our method on complex vision datasets for few-shot learning and compare to unconditional and conditional DDPM baselines. Additionally, we show how conditioning the model on patch-based input set information improves training convergence.
    Searching for the Essence of Adversarial Perturbations. (arXiv:2205.15357v1 [cs.LG])
    Neural networks have achieved the state-of-the-art performance on various machine learning fields, yet the incorporation of malicious perturbations with input data (adversarial example) is able to fool neural networks' predictions. This would lead to potential risks in real-world applications, for example, auto piloting and facial recognition. However, the reason for the existence of adversarial examples remains controversial. Here we demonstrate that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's erroneous prediction. This concept of human-recognizable information allows us to explain key features related to adversarial perturbations, which include the existence of adversarial examples, the transferability among different neural networks, and the increased neural network interpretability for adversarial training. Two unique properties in adversarial perturbations that fool neural networks are uncovered: masking and generation. A special class, the complementary class, is identified when neural networks classify input images. The human-recognizable information contained in adversarial perturbations allows researchers to gain insight on the working principles of neural networks and may lead to develop techniques that detect/defense adversarial attacks.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v1 [cs.LG])
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require concept labels in the training data to learn the bottleneck and do not leverage strong pretrained models. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address the limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining interpretability benefits. When concept annotation is not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts. PCBM also enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new (potentially different) data. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using any data from the target domain or model retraining.
    Fairness in the First Stage of Two-Stage Recommender Systems. (arXiv:2205.15436v1 [cs.IR])
    Many large-scale recommender systems consist of two stages, where the first stage focuses on efficiently generating a small subset of promising candidates from a huge pool of items for the second-stage model to curate final recommendations from. In this paper, we investigate how to ensure groups fairness to the items in this two-stage paradigm. In particular, we find that existing first-stage recommenders might select an irrecoverably unfair set of candidates such that there is no hope for the second-stage recommender to deliver fair recommendations. To this end, we propose two threshold-policy selection rules that, given any relevance model of queries and items and a point-wise lower confidence bound on the expected number of relevant items for each policy, find near-optimal sets of candidates that contain enough relevant items in expectation from each group of items. To instantiate the rules, we demonstrate how to derive such confidence bounds from potentially partial and biased user feedback data, which are abundant in many large-scale recommender systems. In addition, we provide both finite-sample and asymptotic analysis of how close the two threshold selection rules are to the optimal thresholds. Beyond this theoretical analysis, we show empirically that these two rules can consistently select enough relevant items from each group while minimizing the size of the candidate sets for a wide range of settings.
    Reinforcement Learning with a Terminator. (arXiv:2205.15376v1 [cs.LG])
    We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. This formulation accounts for numerous real-world situations, such as a human interrupting an autonomous driving agent for reasons of discomfort. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w.r.t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.
    Segmentation Consistency Training: Out-of-Distribution Generalization for Medical Image Segmentation. (arXiv:2205.15428v1 [cs.CV])
    Generalizability is seen as one of the major challenges in deep learning, in particular in the domain of medical imaging, where a change of hospital or in imaging routines can lead to a complete failure of a model. To tackle this, we introduce Consistency Training, a training procedure and alternative to data augmentation based on maximizing models' prediction consistency across augmented and unaugmented data in order to facilitate better out-of-distribution generalization. To this end, we develop a novel region-based segmentation loss function called Segmentation Inconsistency Loss (SIL), which considers the differences between pairs of augmented and unaugmented predictions and labels. We demonstrate that Consistency Training outperforms conventional data augmentation on several out-of-distribution datasets on polyp segmentation, a popular medical task.
    Neural Optimal Transport with General Cost Functionals. (arXiv:2205.15403v1 [cs.LG])
    We present a novel neural-networks-based algorithm to compute optimal transport (OT) plans and maps for general cost functionals. The algorithm is based on a saddle point reformulation of the OT problem and generalizes prior OT methods for weak and strong cost functionals. As an application, we construct a functional to map data distributions with preserving the class-wise structure of data.
    Multiscale modeling of inelastic materials with Thermodynamics-based Artificial Neural Networks (TANN). (arXiv:2108.13137v3 [cond-mat.mtrl-sci] UPDATED)
    The mechanical behavior of inelastic materials with microstructure is very complex and hard to grasp with heuristic, empirical constitutive models. For this purpose, multiscale, homogenization approaches are often used for performing reliable, accurate predictions of the macroscopic mechanical behavior of solids and structures. Nevertheless, the calculation cost of such approaches is extremely high and prohibitive for real-scale applications involving inelastic materials. Here, we propose the so-called Thermodynamics-based Artificial Neural Networks (TANN) for the constitutive modeling of materials with inelastic and complex microstructure. Our approach integrates thermodynamics-aware dimensionality reduction techniques and thermodynamics-based deep neural networks to identify, in an autonomous way, the constitutive laws and discover the internal state variables of complex inelastic materials. The efficiency and accuracy of TANN in predicting the average and local stress-strain response, the free-energy and the dissipation rate is demonstrated for both regular and perturbed two- and three-dimensional lattice microstructures in inelasticity. TANN manage to identify the internal state variables that characterize the inelastic deformation of the complex microstructural fields. These internal state variables are then used to reconstruct the microdeformation fields of the microstructure at a given state. Finally, a double-scale homogenization scheme (FEMxTANN) is used to solve a large scale boundary value problem. The high performance of the homogenized model using TANN is illustrated through detailed comparisons with microstructural calculations at large scale. An excellent agreement is shown for a variety of monotonous and cyclic stress-strain paths.
    Learning (Very) Simple Generative Models Is Hard. (arXiv:2205.16003v1 [cs.LG])
    Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network $F:\mathbb{R}^d\to\mathbb{R}^{d'}$, let $D$ be the distribution over $\mathbb{R}^{d'}$ given by pushing the standard Gaussian $\mathcal{N}(0,\textrm{Id}_d)$ through $F$. Given i.i.d. samples from $D$, the goal is to output any distribution close to $D$ in statistical distance. We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of $F$ are one-hidden-layer ReLU networks with $\log(d)$ neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for supervised learning and required at least two hidden layers and $\mathrm{poly}(d)$ neurons [Daniely-Vardi '21, Chen-Gollakota-Klivans-Meka '22]. The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with polynomially-bounded slopes such that the pushforward of $\mathcal{N}(0,1)$ under $f$ matches all low-degree moments of $\mathcal{N}(0,1)$.
    Molecular Dipole Moment Learning via Rotationally Equivariant Gaussian Process Regression with Derivatives in Molecular-orbital-based Machine Learning. (arXiv:2205.15510v1 [physics.chem-ph])
    This study extends the accurate and transferable molecular-orbital-based machine learning (MOB-ML) approach to modeling the contribution of electron correlation to dipole moments at the cost of Hartree-Fock computations. A molecular-orbital-based (MOB) pairwise decomposition of the correlation part of the dipole moment is applied, and these pair dipole moments could be further regressed as a universal function of molecular orbitals (MOs). The dipole MOB features consist of the energy MOB features and their responses to electric fields. An interpretable and rotationally equivariant Gaussian process regression (GPR) with derivatives algorithm is introduced to learn the dipole moment more efficiently. The proposed problem setup, feature design, and ML algorithm are shown to provide highly-accurate models for both dipole moment and energies on water and fourteen small molecules. To demonstrate the ability of MOB-ML to function as generalized density-matrix functionals for molecular dipole moments and energies of organic molecules, we further apply the proposed MOB-ML approach to train and test the molecules from the QM9 dataset. The application of local scalable GPR with Gaussian mixture model unsupervised clustering (GMM/GPR) scales up MOB-ML to a large-data regime while retaining the prediction accuracy. In addition, compared with literature results, MOB-ML provides the best test MAEs of 4.21 mDebye and 0.045 kcal/mol for dipole moment and energy models, respectively, when training on 110000 QM9 molecules. The excellent transferability of the resulting QM9 models is also illustrated by the accurate predictions for four different series of peptides.
    Timing is Everything: Learning to Act Selectively with Costly Actions and Budgetary Constraints. (arXiv:2205.15953v1 [cs.LG])
    Many real-world settings involve costs for performing actions; transaction costs in financial systems and fuel costs being common examples. In these settings, performing actions at each time step quickly accumulates costs leading to vastly suboptimal outcomes. Additionally, repeatedly acting produces wear and tear and ultimately, damage. Determining when to act is crucial for achieving successful outcomes and yet, the challenge of efficiently learning to behave optimally when actions incur minimally bounded costs remains unresolved. In this paper, we introduce a reinforcement learning (RL) framework named Learnable Impulse Control Reinforcement Algorithm (LICRA), for learning to optimally select both when to act and which actions to take when actions incur costs. At the core of LICRA is a nested structure that combines RL and a form of policy known as impulse control which learns to maximise objectives when actions incur costs. We prove that LICRA, which seamlessly adopts any RL method, converges to policies that optimally select when to perform actions and their optimal magnitudes. We then augment LICRA to handle problems in which the agent can perform at most $k<\infty$ actions and more generally, faces a budget constraint. We show LICRA learns the optimal value function and ensures budget constraints are satisfied almost surely. We demonstrate empirically LICRA's superior performance against benchmark RL methods in OpenAI gym's Lunar Lander and in Highway environments and a variant of the Merton portfolio problem within finance.
    Posterior and Computational Uncertainty in Gaussian Processes. (arXiv:2205.15449v1 [cs.LG])
    Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
    FedWalk: Communication Efficient Federated Unsupervised Node Embedding with Differential Privacy. (arXiv:2205.15896v1 [cs.DC])
    Node embedding aims to map nodes in the complex graph into low-dimensional representations. The real-world large-scale graphs and difficulties of labeling motivate wide studies of unsupervised node embedding problems. Nevertheless, previous effort mostly operates in a centralized setting where a complete graph is given. With the growing awareness of data privacy, data holders who are only aware of one vertex and its neighbours demand greater privacy protection. In this paper, we introduce FedWalk, a random-walk-based unsupervised node embedding algorithm that operates in such a node-level visibility graph with raw graph information remaining locally. FedWalk is designed to offer centralized competitive graph representation capability with data privacy protection and great communication efficiency. FedWalk instantiates the prevalent federated paradigm and contains three modules. We first design a hierarchical clustering tree (HCT) constructor to extract the structural feature of each node. A dynamic time wrapping algorithm seamlessly handles the structural heterogeneity across different nodes. Based on the constructed HCT, we then design a random walk generator, wherein a sequence encoder is designed to preserve privacy and a two-hop neighbor predictor is designed to save communication cost. The generated random walks are then used to update node embedding based on a SkipGram model. Extensive experiments on two large graphs demonstrate that Fed-Walk achieves competitive representativeness as a centralized node embedding algorithm does with only up to 1.8% Micro-F1 score and 4.4% Marco-F1 score loss while reducing about 6.7 times of inter-device communication per walk.
    Fooling SHAP with Stealthily Biased Sampling. (arXiv:2205.15419v1 [cs.LG])
    SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups, while remaining undetected. These results highlight the manipulability of SHAP explanations and encourage auditors to treat post-hoc explanations with skepticism.
    A Unified Weight Initialization Paradigm for Tensorial Convolutional Neural Networks. (arXiv:2205.15307v1 [cs.LG])
    Tensorial Convolutional Neural Networks (TCNNs) have attracted much research attention for their power in reducing model parameters or enhancing the generalization ability. However, exploration of TCNNs is hindered even from weight initialization methods. To be specific, general initialization methods, such as Xavier or Kaiming initialization, usually fail to generate appropriate weights for TCNNs. Meanwhile, although there are ad-hoc approaches for specific architectures (e.g., Tensor Ring Nets), they are not applicable to TCNNs with other tensor decomposition methods (e.g., CP or Tucker decomposition). To address this problem, we propose a universal weight initialization paradigm, which generalizes Xavier and Kaiming methods and can be widely applicable to arbitrary TCNNs. Specifically, we first present the Reproducing Transformation to convert the backward process in TCNNs to an equivalent convolution process. Then, based on the convolution operators in the forward and backward processes, we build a unified paradigm to control the variance of features and gradients in TCNNs. Thus, we can derive fan-in and fan-out initialization for various TCNNs. We demonstrate that our paradigm can stabilize the training of TCNNs, leading to faster convergence and better results.
    Painful intelligence: What AI can tell us about human suffering. (arXiv:2205.15409v1 [cs.LG])
    This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected. Frustration is inevitable because of the overwhelming complexity of the world, limited computational resources, and scarcity of good data. In particular, such limitations imply that an agent acting in the real world must cope with uncontrollability, unpredictability, and uncertainty, which all lead to frustration. Fundamental in such modelling is the idea of learning, or adaptation to the environment. While AI uses machine learning, humans and animals adapt by a combination of evolutionary mechanisms and ordinary learning. Even frustration is fundamentally an error signal that the system uses for learning. This book explores various aspects and limitations of learning algorithms and their implications regarding suffering. At the end of the book, the computational theory is used to derive various interventions or training methods that will reduce suffering in humans. The amount of frustration is expressed by a simple equation which indicates how it can be reduced. The ensuing interventions are very similar to those proposed by Buddhist and Stoic philosophy, and include mindfulness meditation. Therefore, this book can be interpreted as an exposition of a computational theory justifying why such philosophies and meditation reduce human suffering.
    Continual Object Detection: A review of definitions, strategies, and challenges. (arXiv:2205.15445v1 [cs.CV])
    The field of Continual Learning investigates the ability to learn consecutive tasks without losing performance on those previously learned. Its focus has been mainly on incremental classification tasks. We believe that research in continual object detection deserves even more attention due to its vast range of applications in robotics and autonomous vehicles. This scenario is more complex than conventional classification given the occurrence of instances of classes that are unknown at the time, but can appear in subsequent tasks as a new class to be learned, resulting in missing annotations and conflicts with the background label. In this review, we analyze the current strategies proposed to tackle the problem of class-incremental object detection. Our main contributions are: (1) a short and systematic review of the methods that propose solutions to traditional incremental object detection scenarios; (2) A comprehensive evaluation of the existing approaches using a new metric to quantify the stability and plasticity of each technique in a standard way; (3) an overview of the current trends within continual object detection and a discussion of possible future research directions.
    Hierarchies of Reward Machines. (arXiv:2205.15752v1 [cs.LG])
    Reward machines (RMs) are a recent formalism for representing the reward function of a reinforcement learning task through a finite-state machine whose edges encode landmarks of the task using high-level events. The structure of RMs enables the decomposition of a task into simpler and independently solvable subtasks that help tackle long-horizon and/or sparse reward tasks. We propose a formalism for further abstracting the subtask structure by endowing an RM with the ability to call other RMs, thus composing a hierarchy of RMs (HRM). We exploit HRMs by treating each call to an RM as an independently solvable subtask using the options framework, and describe a curriculum-based method to induce HRMs from example traces observed by the agent. Our experiments reveal that exploiting a handcrafted HRM leads to faster convergence than with a flat HRM, and that learning an HRM is more scalable than learning an equivalent flat HRM.
    Learning brain MRI quality control: a multi-factorial generalization problem. (arXiv:2205.15898v1 [stat.ML])
    Due to the growing number of MRI data, automated quality control (QC) has become essential, especially for larger scale analysis. Several attempts have been made in order to develop reliable and scalable QC pipelines. However, the generalization of these methods on new data independent of those used for learning is a difficult problem because of the biases inherent in MRI data. This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets (ABIDE, N = 1102 and CATI derived datasets, N = 9037) used for both training and evaluation purposes. We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them. We further analyzed the site-wise and study-wise predicted classification probability distributions of the models without preprocessing trained on ABIDE and CATI data. Our main results were that a model using features extracted from MRIQC without preprocessing yielded the best results when trained and evaluated on large multi-center datasets with a heterogeneous population (an improvement of the ROC-AUC score on unseen data of 0.10 for the model trained on a subset of the CATI dataset). We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data. In spite of the performance improvement, the generalization abilities of the models remain questionable when looking at the site-wise/study-wise probability predictions and the optimal classification threshold derived from them.
    Designing Rewards for Fast Learning. (arXiv:2205.15400v1 [cs.LG])
    To convey desired behavior to a Reinforcement Learning (RL) agent, a designer must choose a reward function for the environment, arguably the most important knob designers have in interacting with RL agents. Although many reward functions induce the same optimal behavior (Ng et al., 1999), in practice, some of them result in faster learning than others. In this paper, we look at how reward-design choices impact learning speed and seek to identify principles of good reward design that quickly induce target behavior. This reward-identification problem is framed as an optimization problem: Firstly, we advocate choosing state-based rewards that maximize the action gap, making optimal actions easy to distinguish from suboptimal ones. Secondly, we propose minimizing a measure of the horizon, something we call the "subjective discount", over which rewards need to be optimized to encourage agents to make optimal decisions with less lookahead. To solve this optimization problem, we propose a linear-programming based algorithm that efficiently finds a reward function that maximizes action gap and minimizes subjective discount. We test the rewards generated with the algorithm in tabular environments with Q-Learning, and empirically show they lead to faster learning. Although we only focus on Q-Learning because it is perhaps the simplest and most well understood RL algorithm, preliminary results with R-max (Brafman and Tennenholtz, 2000) suggest our results are much more general. Our experiments support three principles of reward design: 1) consistent with existing results, penalizing each step taken induces faster learning than rewarding the goal. 2) When rewarding subgoals along the target trajectory, rewards should gradually increase as the goal gets closer. 3) Dense reward that's nonzero on every state is only good if designed carefully.
    Holistic Generalized Linear Models. (arXiv:2205.15447v1 [stat.ML])
    Holistic linear regression extends the classical best subset selection problem by adding additional constraints designed to improve the model quality. These constraints include sparsity-inducing constraints, sign-coherence constraints and linear constraints. The $\textsf{R}$ package $\texttt{holiglm}$ provides functionality to model and fit holistic generalized linear models. By making use of state-of-the-art conic mixed-integer solvers, the package can reliably solve GLMs for Gaussian, binomial and Poisson responses with a multitude of holistic constraints. The high-level interface simplifies the constraint specification and can be used as a drop-in replacement for the $\texttt{stats::glm()}$ function.
    Predicting Day-Ahead Stock Returns using Search Engine Query Volumes: An Application of Gradient Boosted Decision Trees to the S&P 100. (arXiv:2205.15853v1 [econ.EM])
    The internet has changed the way we live, work and take decisions. As it is the major modern resource for research, detailed data on internet usage exhibits vast amounts of behavioral information. This paper aims to answer the question whether this information can be facilitated to predict future returns of stocks on financial capital markets. In an empirical analysis it implements gradient boosted decision trees to learn relationships between abnormal returns of stocks within the S&P 100 index and lagged predictors derived from historical financial data, as well as search term query volumes on the internet search engine Google. Models predict the occurrence of day-ahead stock returns in excess of the index median. On a time frame from 2005 to 2017, all disparate datasets exhibit valuable information. Evaluated models have average areas under the receiver operating characteristic between 54.2% and 56.7%, clearly indicating a classification better than random guessing. Implementing a simple statistical arbitrage strategy, models are used to create daily trading portfolios of ten stocks and result in annual performances of more than 57% before transaction costs. With ensembles of different data sets topping up the performance ranking, the results further question the weak form and semi-strong form efficiency of modern financial capital markets. Even though transaction costs are not included, the approach adds to the existing literature. It gives guidance on how to use and transform data on internet usage behavior for financial and economic modeling and forecasting.
    Critic Sequential Monte Carlo. (arXiv:2205.15460v1 [stat.ML])
    We introduce CriticSMC, a new algorithm for planning as inference built from a novel composition of sequential Monte Carlo with learned soft-Q function heuristic factors. This algorithm is structured so as to allow using large numbers of putative particles leading to efficient utilization of computational resource and effective discovery of high reward trajectories even in environments with difficult reward surfaces such as those arising from hard constraints. Relative to prior art our approach is notably still compatible with model-free reinforcement learning in the sense that the implicit policy we produce can be used at test time in the absence of a world model. Our experiments on self-driving car collision avoidance in simulation demonstrate improvements against baselines in terms of infraction minimization relative to computational effort while maintaining diversity and realism of found trajectories.
    PolypConnect: Image inpainting for generating realistic gastrointestinal tract images with polyps. (arXiv:2205.15413v1 [eess.IV])
    Early identification of a polyp in the lower gastrointestinal (GI) tract can lead to prevention of life-threatening colorectal cancer. Developing computer-aided diagnosis (CAD) systems to detect polyps can improve detection accuracy and efficiency and save the time of the domain experts called endoscopists. Lack of annotated data is a common challenge when building CAD systems. Generating synthetic medical data is an active research area to overcome the problem of having relatively few true positive cases in the medical domain. To be able to efficiently train machine learning (ML) models, which are the core of CAD systems, a considerable amount of data should be used. In this respect, we propose the PolypConnect pipeline, which can convert non-polyp images into polyp images to increase the size of training datasets for training. We present the whole pipeline with quantitative and qualitative evaluations involving endoscopists. The polyp segmentation model trained using synthetic data, and real data shows a 5.1% improvement of mean intersection over union (mIOU), compared to the model trained only using real data. The codes of all the experiments are available on GitHub to reproduce the results.
    Chefs' Random Tables: Non-Trigonometric Random Features. (arXiv:2205.15317v1 [cs.LG])
    We introduce chefs' random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity.
    Superposing Many Tickets into One: A Performance Booster for Sparse Neural Network Training. (arXiv:2205.15322v1 [cs.LG])
    Recent works on sparse neural network training (sparse training) have shown that a compelling trade-off between performance and efficiency can be achieved by training intrinsically sparse neural networks from scratch. Existing sparse training methods usually strive to find the best sparse subnetwork possible in one single run, without involving any expensive dense or pre-training steps. For instance, dynamic sparse training (DST), as one of the most prominent directions, is capable of reaching a competitive performance of dense training by iteratively evolving the sparse topology during the course of training. In this paper, we argue that it is better to allocate the limited resources to create multiple low-loss sparse subnetworks and superpose them into a stronger one, instead of allocating all resources entirely to find an individual subnetwork. To achieve this, two desiderata are required: (1) efficiently producing many low-loss subnetworks, the so-called cheap tickets, within one training process limited to the standard training time used in dense training; (2) effectively superposing these cheap tickets into one stronger subnetwork without going over the constrained parameter budget. To corroborate our conjecture, we present a novel sparse training approach, termed \textbf{Sup-tickets}, which can satisfy the above two desiderata concurrently in a single sparse-to-sparse training process. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that Sup-tickets integrates seamlessly with the existing sparse training methods and demonstrates consistent performance improvement.
    Infinite-dimensional optimization and Bayesian nonparametric learning of stochastic differential equations. (arXiv:2205.15368v1 [stat.ML])
    The paper has two major themes. The first part of the paper establishes certain general results for infinite-dimensional optimization problems on Hilbert spaces. These results cover the classical representer theorem and many of its variants as special cases and offer a wider scope of applications. The second part of the paper then develops a systematic approach for learning the drift function of a stochastic differential equation by integrating the results of the first part with Bayesian hierarchical framework. Importantly, our Baysian approach incorporates low-cost sparse learning through proper use of shrinkage priors while allowing proper quantification of uncertainty through posterior distributions. Several examples at the end illustrate the accuracy of our learning scheme.
    A Design Space for Explainable Ranking and Ranking Models. (arXiv:2205.15305v1 [cs.LG])
    Item ranking systems support users in multi-criteria decision-making tasks. Users need to trust rankings and ranking algorithms to reflect user preferences nicely while avoiding systematic errors and biases. However, today only few approaches help end users, model developers, and analysts to explain rankings. We report on the study of explanation approaches from the perspectives of recommender systems, explainable AI, and visualization research and propose the first cross-domain design space for explainers of item rankings. In addition, we leverage the descriptive power of the design space to characterize a) existing explainers and b) three main user groups involved in ranking explanation tasks. The generative power of the design space is a means for future designers and developers to create more target-oriented solutions in this only weakly exploited space.
    Learning Adaptive Propagation for Knowledge Graph Reasoning. (arXiv:2205.15319v1 [cs.LG])
    Due to the success of Graph Neural Networks (GNNs) in learning from graph-structured data, various GNN-based methods have been introduced to learn from knowledge graphs (KGs). In this paper, to reveal the key factors underneath existing GNN-based methods, we revisit exemplar works from the lens of the propagation path. We find that the answer entity can be close to queried one, but the information dependency can be long. Thus, better reasoning performance can be obtained by exploring longer propagation paths. However, identifying such a long-range dependency in KG is hard since the number of involved entities grows exponentially. This motivates us to learn an adaptive propagation path that filters out irrelevant entities while preserving promising targets during the propagation. First, we design an incremental sampling mechanism where the close and promising target can be preserved. Second, we design a learning-based sampling distribution to identify the targets with fewer involved entities. In this way, GNN can go deeper to capture long-range information. Extensive experiments show that our method is efficient and achieves state-of-the-art performances in both transductive and inductive reasoning settings, benefiting from the deeper propagation.
    Sepsis Prediction with Temporal Convolutional Networks. (arXiv:2205.15492v1 [cs.LG])
    We design and implement a temporal convolutional network model to predict sepsis onset. Our model is trained on data extracted from MIMIC III database, based on a retrospective analysis of patients admitted to intensive care unit who did not fall under the definition of sepsis at the time of admission. Benchmarked with several machine learning models, our model is superior on this binary classification task, demonstrates the prediction power of convolutional networks for temporal patterns, also shows the significant impact of having longer look back time on sepsis prediction.
    A novel approach to rating transition modelling via Machine Learning and SDEs on Lie groups. (arXiv:2205.15699v1 [q-fin.RM])
    In this paper, we introduce a novel methodology to model rating transitions with a stochastic process. To introduce stochastic processes, whose values are valid rating matrices, we noticed the geometric properties of stochastic matrices and its link to matrix Lie groups. We give a gentle introduction to this topic and demonstrate how It\^o-SDEs in R will generate the desired model for rating transitions. To calibrate the rating model to historical data, we use a Deep-Neural-Network (DNN) called TimeGAN to learn the features of a time series of historical rating matrices. Then, we use this DNN to generate synthetic rating transition matrices. Afterwards, we fit the moments of the generated rating matrices and the rating process at specific time points, which results in a good fit. After calibration, we discuss the quality of the calibrated rating transition process by examining some properties that a time series of rating matrices should satisfy, and we will see that this geometric approach works very well.
    A deep learning approach to halo merger tree construction. (arXiv:2205.15988v1 [astro-ph.GA])
    A key ingredient for semi-analytic models (SAMs) of galaxy formation is the mass assembly history of haloes, encoded in a tree structure. The most commonly used method to construct halo merger histories is based on the outcomes of high-resolution, computationally intensive N-body simulations. We show that machine learning (ML) techniques, in particular Generative Adversarial Networks (GANs), are a promising new tool to tackle this problem with a modest computational cost and retaining the best features of merger trees from simulations. We train our GAN model with a limited sample of merger trees from the EAGLE simulation suite, constructed using two halo finders-tree builder algorithms: SUBFIND-D-TREES and ROCKSTAR-ConsistentTrees. Our GAN model successfully learns to generate well-constructed merger tree structures with high temporal resolution, and to reproduce the statistical features of the sample of merger trees used for training, when considering up to three variables in the training process. These inputs, whose representations are also learned by our GAN model, are mass of the halo progenitors and the final descendant, progenitor type (main halo or satellite) and distance of a progenitor to that in the main branch. The inclusion of the latter two inputs greatly improves the final learned representation of the halo mass growth history, especially for SUBFIND-like ML trees. When comparing equally sized samples of ML merger trees with those of the EAGLE simulation, we find better agreement for SUBFIND-like ML trees. Finally, our GAN-based framework can be utilised to construct merger histories of low and intermediate mass haloes, the most abundant in cosmological simulations.
    Optimistic Whittle Index Policy: Online Learning for Restless Bandits. (arXiv:2205.15372v1 [cs.LG])
    Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. However, solving RMABs requires information on transition dynamics, which is often not available upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we formulate a bilinear program to compute the optimistic Whittle index from the confidence bounds in transition dynamics. Our algorithm, UCWhittle, achieves sublinear $O(\sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including on real-world maternal and childcare data aimed at reducing maternal mortality.
    Gluing Neural Networks Symbolically Through Hyperdimensional Computing. (arXiv:2205.15534v1 [cs.SC])
    Hyperdimensional Computing affords simple, yet powerful operations to create long Hyperdimensional Vectors (hypervectors) that can efficiently encode information, be used for learning, and are dynamic enough to be modified on the fly. In this paper, we explore the notion of using binary hypervectors to directly encode the final, classifying output signals of neural networks in order to fuse differing networks together at the symbolic level. This allows multiple neural networks to work together to solve a problem, with little additional overhead. Output signals just before classification are encoded as hypervectors and bundled together through consensus summation to train a classification hypervector. This process can be performed iteratively and even on single neural networks by instead making a consensus of multiple classification hypervectors. We find that this outperforms the state of the art, or is on a par with it, while using very little overhead, as hypervector operations are extremely fast and efficient in comparison to the neural networks. This consensus process can learn online and even grow or lose models in real time. Hypervectors act as memories that can be stored, and even further bundled together over time, affording life long learning capabilities. Additionally, this consensus structure inherits the benefits of Hyperdimensional Computing, without sacrificing the performance of modern Machine Learning. This technique can be extrapolated to virtually any neural model, and requires little modification to employ - one simply requires recording the output signals of networks when presented with a testing example.
    Comparing interpretation methods in mental state decoding analyses with deep learning models. (arXiv:2205.15581v1 [q-bio.NC])
    Deep learning (DL) methods find increasing application in mental state decoding, where researchers seek to understand the mapping between mental states (such as accepting or rejecting a gamble) and brain activity, by identifying those brain regions (and networks) whose activity allows to accurately identify (i.e., decode) these states. Once DL models have been trained to accurately decode a set of mental states, neuroimaging researchers often make use of interpretation methods from explainable artificial intelligence research to understand their learned mappings between mental states and brain activity. Here, we compare the explanations of prominent interpretation methods for the mental state decoding decisions of DL models trained on three functional Magnetic Resonance Imaging (fMRI) datasets. We find that interpretation methods that capture the model's decision process well, by producing faithful explanations, generally produce explanations that are less in line with the results of standard analyses of the fMRI data, when compared to the explanations of interpretation methods with less explanation faithfulness. Specifically, we find that interpretation methods that focus on how sensitively a model's decoding decision changes with the values of the input produce explanations that better match with the results of a standard general linear model analysis of the fMRI data, while interpretation methods that focus on identifying the specific contribution of an input feature's value to the decoding decision produce overall more faithful explanations that align less well with the results of standard analyses of the fMRI data.
    Knowledge Enhanced Neural Networks for relational domains. (arXiv:2205.15762v1 [cs.LG])
    In the recent past, there has been a growing interest in Neural-Symbolic Integration frameworks, i.e., hybrid systems that integrate connectionist and symbolic approaches to obtain the best of both worlds. In this work we focus on a specific method, KENN (Knowledge Enhanced Neural Networks), a Neural-Symbolic architecture that injects prior logical knowledge into a neural network by adding on its top a residual layer that modifies the initial predictions accordingly to the knowledge. Among the advantages of this strategy, there is the inclusion of clause weights, learnable parameters that represent the strength of the clauses, meaning that the model can learn the impact of each rule on the final predictions. As a special case, if the training data contradicts a constraint, KENN learns to ignore it, making the system robust to the presence of wrong knowledge. In this paper, we propose an extension of KENN for relational data. One of the main advantages of KENN resides in its scalability, thanks to a flexible treatment of dependencies between the rules obtained by stacking multiple logical layers. We show experimentally the efficacy of this strategy. The results show that KENN is capable of increasing the performances of the underlying neural network, obtaining better or comparable accuracies in respect to other two related methods that combine learning with logic, requiring significantly less time for learning.
    VQ-AR: Vector Quantized Autoregressive Probabilistic Time Series Forecasting. (arXiv:2205.15894v1 [cs.LG])
    Time series models aim for accurate predictions of the future given the past, where the forecasts are used for important downstream tasks like business decision making. In practice, deep learning based time series models come in many forms, but at a high level learn some continuous representation of the past and use it to output point or probabilistic forecasts. In this paper, we introduce a novel autoregressive architecture, VQ-AR, which instead learns a \emph{discrete} set of representations that are used to predict the future. Extensive empirical comparison with other competitive deep learning models shows that surprisingly such a discrete set of representations gives state-of-the-art or equivalent results on a wide variety of time series datasets. We also highlight the shortcomings of this approach, explore its zero-shot generalization capabilities, and present an ablation study on the number of representations. The full source code of the method will be available at the time of publication with the hope that researchers can further investigate this important but overlooked inductive bias for the time series domain.
    Static Scheduling with Predictions Learned through Efficient Exploration. (arXiv:2205.15695v1 [cs.LG])
    A popular approach to go beyond the worst-case analysis of online algorithms is to assume the existence of predictions that can be leveraged to improve performances. Those predictions are usually given by some external sources that cannot be fully trusted. Instead, we argue that trustful predictions can be built by algorithms, while they run. We investigate this idea in the illustrative context of static scheduling with exponential job sizes. Indeed, we prove that algorithms agnostic to this structure do not perform better than in the worst case. In contrast, when the expected job sizes are known, we show that the best algorithm using this information, called Follow-The-Perfect-Prediction (FTPP), exhibits much better performances. Then, we introduce two adaptive explore-then-commit types of algorithms: they both first (partially) learn expected job sizes and then follow FTPP once their self-predictions are confident enough. On the one hand, ETCU explores in "series", by completing jobs sequentially to acquire information. On the other hand, ETCRR, inspired by the optimal worst-case algorithm Round-Robin (RR), explores efficiently in "parallel". We prove that both of them asymptotically reach the performances of FTPP, with a faster rate for ETCRR. Those findings are empirically evaluated on synthetic data.
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v1 [cs.LG])
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs.
    Optimizing Intermediate Representations of Generative Models for Phase Retrieval. (arXiv:2205.15617v1 [cs.LG])
    Phase retrieval is the problem of reconstructing images from magnitude-only measurements. In many real-world applications the problem is underdetermined. When training data is available, generative models are a new idea to constrain the solution set. However, not all possible solutions are within the range of the generator. Instead, they are represented with some error. To reduce this representation error in the context of phase retrieval, we first leverage a novel variation of intermediate layer optimization (ILO) to extend the range of the generator while still producing images consistent with the training data. Second, we introduce new initialization schemes that further improve the quality of the reconstruction. With extensive experiments on Fourier and Gaussian phase retrieval problems and thorough ablation studies, we can show the benefits of our modified ILO and the new initialization schemes.
    Exact Feature Collisions in Neural Networks. (arXiv:2205.15763v1 [cs.LG])
    Predictions made by deep neural networks were shown to be highly sensitive to small changes made in the input space where such maliciously crafted data points containing small perturbations are being referred to as adversarial examples. On the other hand, recent research suggests that the same networks can also be extremely insensitive to changes of large magnitude, where predictions of two largely different data points can be mapped to approximately the same output. In such cases, features of two data points are said to approximately collide, thus leading to the largely similar predictions. Our results improve and extend the work of Li et al.(2019), laying out theoretical grounds for the data points that have colluding features from the perspective of weights of neural networks, revealing that neural networks not only suffer from features that approximately collide but also suffer from features that exactly collide. We identify the necessary conditions for the existence of such scenarios, hereby investigating a large number of DNNs that have been used to solve various computer vision problems. Furthermore, we propose the Null-space search, a numerical approach that does not rely on heuristics, to create data points with colliding features for any input and for any task, including, but not limited to, classification, localization, and segmentation.
    Machine learning a manifold. (arXiv:2112.07673v2 [hep-ph] UPDATED)
    We propose a simple method to identify a continuous Lie algebra symmetry in a dataset through regression by an artificial neural network. Our proposal takes advantage of the $ \mathcal{O}(\epsilon^2)$ scaling of the output variable under infinitesimal symmetry transformations on the input variables. As symmetry transformations are generated post-training, the methodology does not rely on sampling of the full representation space or binning of the dataset, and the possibility of false identification is minimised. We demonstrate our method in the SU(3)-symmetric (non-) linear $\Sigma$ model.
    Provable General Function Class Representation Learning in Multitask Bandits and MDPs. (arXiv:2205.15701v1 [cs.LG])
    While multitask representation learning has become a popular approach in reinforcement learning (RL) to boost the sample efficiency, the theoretical understanding of why and how it works is still limited. Most previous analytical works could only assume that the representation function is already known to the agent or from linear function class, since analyzing general function class representation encounters non-trivial technical obstacles such as generalization guarantee, formulation of confidence bound in abstract function space, etc. However, linear-case analysis heavily relies on the particularity of linear function class, while real-world practice usually adopts general non-linear representation functions like neural networks. This significantly reduces its applicability. In this work, we extend the analysis to general function class representations. Specifically, we consider an agent playing $M$ contextual bandits (or MDPs) concurrently and extracting a shared representation function $\phi$ from a specific function class $\Phi$ using our proposed Generalized Functional Upper Confidence Bound algorithm (GFUCB). We theoretically validate the benefit of multitask representation learning within general function class for bandits and linear MDP for the first time. Lastly, we conduct experiments to demonstrate the effectiveness of our algorithm with neural net representation.
    MACE: An Efficient Model-Agnostic Framework for Counterfactual Explanation. (arXiv:2205.15540v1 [cs.AI])
    Counterfactual explanation is an important Explainable AI technique to explain machine learning predictions. Despite being studied actively, existing optimization-based methods often assume that the underlying machine-learning model is differentiable and treat categorical attributes as continuous ones, which restricts their real-world applications when categorical attributes have many different values or the model is non-differentiable. To make counterfactual explanation suitable for real-world applications, we propose a novel framework of Model-Agnostic Counterfactual Explanation (MACE), which adopts a newly designed pipeline that can efficiently handle non-differentiable machine-learning models on a large number of feature values. in our MACE approach, we propose a novel RL-based method for finding good counterfactual examples and a gradient-less descent method for improving proximity. Experiments on public datasets validate the effectiveness with better validity, sparsity and proximity.
    Generalised Implicit Neural Representations. (arXiv:2205.15674v1 [cs.LG])
    We consider the problem of learning implicit neural representations (INRs) for signals on non-Euclidean domains. In the Euclidean case, INRs are trained on a discrete sampling of a signal over a regular lattice. Here, we assume that the continuous signal exists on some unknown topological space from which we sample a discrete graph. In the absence of a coordinate system to identify the sampled nodes, we propose approximating their location with a spectral embedding of the graph. This allows us to train INRs without knowing the underlying continuous domain, which is the case for most graph signals in nature, while also making the INRs equivariant under the symmetry group of the domain. We show experiments with our method on various real-world signals on non-Euclidean domains.
    itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection. (arXiv:2205.15531v1 [cs.CV])
    Recently, point-cloud based 3D object detectors have achieved remarkable progress. However, most studies are limited to the development of deep learning architectures for improving only their accuracy. In this paper, we propose an autoencoder-style framework comprising channel-wise compression and decompression via interchange transfer for knowledge distillation. To learn the map-view feature of a teacher network, the features from a teacher and student network are independently passed through the shared autoencoder; here, we use a compressed representation loss that binds the channel-wised compression knowledge from both the networks as a kind of regularization. The decompressed features are transferred in opposite directions to reduce the gap in the interchange reconstructions. Lastly, we present an attentive head loss for matching the pivotal detection information drawn by the multi-head self-attention mechanism. Through extensive experiments, we verify that our method can learn the lightweight model that is well-aligned with the 3D point cloud detection task and we demonstrate its superiority using the well-known public datasets Waymo and nuScenes.
    StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. (arXiv:2205.15439v1 [eess.AS])
    Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.
    Graph-level Neural Networks: Current Progress and Future Directions. (arXiv:2205.15555v1 [cs.LG])
    Graph-structured data consisting of objects (i.e., nodes) and relationships among objects (i.e., edges) are ubiquitous. Graph-level learning is a matter of studying a collection of graphs instead of a single graph. Traditional graph-level learning methods used to be the mainstream. However, with the increasing scale and complexity of graphs, Graph-level Neural Networks (GLNNs, deep learning-based graph-level learning methods) have been attractive due to their superiority in modeling high-dimensional data. Thus, a survey on GLNNs is necessary. To frame this survey, we propose a systematic taxonomy covering GLNNs upon deep neural networks, graph neural networks, and graph pooling. The representative and state-of-the-art models in each category are focused on this survey. We also investigate the reproducibility, benchmarks, and new graph datasets of GLNNs. Finally, we conclude future directions to further push forward GLNNs. The repository of this survey is available at https://github.com/GeZhangMQ/Awesome-Graph-level-Neural-Networks.
    Augmentation-Aware Self-Supervision for Data-Efficient GAN Training. (arXiv:2205.15677v1 [cs.LG])
    Training generative adversarial networks (GANs) with limited data is valuable but challenging because discriminators are prone to over-fitting in such situations. Recently proposed differentiable data augmentation techniques for discriminators demonstrate improved data efficiency of training GANs. However, the naive data augmentation introduces undesired invariance to augmentation into the discriminator. The invariance may degrade the representation learning ability of the discriminator, thereby affecting the generative modeling performance of the generator. To mitigate the invariance while inheriting the benefits of data augmentation, we propose a novel augmentation-aware self-supervised discriminator that predicts the parameter of augmentation given the augmented and original data. Moreover, the prediction task is required to distinguishable between real data and generated data since they are different during training. We further encourage the generator to learn from the proposed discriminator by generating augmentation-predictable real data. We compare the proposed method with state-of-the-arts across the class-conditional BigGAN and unconditional StyleGAN2 architectures on CIFAR-10/100 and several low-shot datasets, respectively. Experimental results show a significantly improved generation performance of our method over competing methods for training data-efficient GANs.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v1 [cs.LG])
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    Certifying Some Distributional Fairness with Subpopulation Decomposition. (arXiv:2205.15494v1 [cs.LG])
    Extensive efforts have been made to understand and improve the fairness of machine learning models based on observational metrics, especially in high-stakes domains such as medical insurance, education, and hiring decisions. However, there is a lack of certified fairness considering the end-to-end performance of an ML model. In this paper, we first formulate the certified fairness of an ML model trained on a given data distribution as an optimization problem based on the model performance loss bound on a fairness constrained distribution, which is within bounded distributional distance with the training distribution. We then propose a general fairness certification framework and instantiate it for both sensitive shifting and general shifting scenarios. In particular, we propose to solve the optimization problem by decomposing the original data distribution into analytical subpopulations and proving the convexity of the subproblems to solve them. We evaluate our certified fairness on six real-world datasets and show that our certification is tight in the sensitive shifting scenario and provides non-trivial certification under general shifting. Our framework is flexible to integrate additional non-skewness constraints and we show that it provides even tighter certification under different real-world scenarios. We also compare our certified fairness bound with adapted existing distributional robustness bounds on Gaussian data and demonstrate that our method is significantly tighter.
    FBM: Fast-Bit Allocation for Mixed-Precision Quantization. (arXiv:2205.15437v1 [cs.LG])
    Quantized neural networks are well known for reducing latency, power consumption, and model size without significant degradation in accuracy, making them highly applicable for systems with limited resources and low power requirements. Mixed precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths. Existing mixed-precision schemes rely on having a high exploration space, resulting in a large carbon footprint. In addition, these bit allocation strategies mostly induce constraints on the model size rather than utilizing the performance of neural network deployment on specific hardware. Our work proposes Fast-Bit Allocation for Mixed-Precision Quantization (FBM), which finds an optimal bitwidth allocation by measuring desired behaviors through a simulation of a specific device, or even on a physical one. While dynamic transitions of bit allocation in mixed precision quantization with ultra-low bitwidth are known to suffer from performance degradation, we present a fast recovery solution from such transitions. A comprehensive evaluation of the proposed method on CIFAR-10 and ImageNet demonstrates our method's superiority over current state-of-the-art schemes in terms of the trade-off between neural network accuracy and hardware efficiency. Our source code, experimental settings and quantized models are available at https://github.com/RamorayDrake/FBM/
    Connecting adversarial attacks and optimal transport for domain adaptation. (arXiv:2205.15424v1 [cs.LG])
    We present a novel algorithm for domain adaptation using optimal transport. In domain adaptation, the goal is to adapt a classifier trained on the source domain samples to the target domain. In our method, we use optimal transport to map target samples to the domain named source fiction. This domain differs from the source but is accurately classified by the source domain classifier. Our main idea is to generate a source fiction by c-cyclically monotone transformation over the target domain. If samples with the same labels in two domains are c-cyclically monotone, the optimal transport map between these domains preserves the class-wise structure, which is the main goal of domain adaptation. To generate a source fiction domain, we propose an algorithm that is based on our finding that adversarial attacks are a c-cyclically monotone transformation of the dataset. We conduct experiments on Digits and Modern Office-31 datasets and achieve improvement in performance for simple discrete optimal transport solvers for all adaptation tasks.
    Private Federated Submodel Learning with Sparsification. (arXiv:2205.15992v1 [cs.IT])
    We investigate the problem of private read update write (PRUW) in federated submodel learning (FSL) with sparsification. In FSL, a machine learning model is divided into multiple submodels, where each user updates only the submodel that is relevant to the user's local data. PRUW is the process of privately performing FSL by reading from and writing to the required submodel without revealing the submodel index, or the values of updates to the databases. Sparsification is a widely used concept in learning, where the users update only a small fraction of parameters to reduce the communication cost. Revealing the coordinates of these selected (sparse) updates leaks privacy of the user. We show how PRUW in FSL can be performed with sparsification. We propose a novel scheme which privately reads from and writes to arbitrary parameters of any given submodel, without revealing the submodel index, values of updates, or the coordinates of the sparse updates, to databases. The proposed scheme achieves significantly lower reading and writing costs compared to what is achieved without sparsification.
    A Topological Perspective on Causal Inference. (arXiv:2107.08558v3 [cs.AI] UPDATED)
    This paper presents a topological learning-theoretic perspective on causal inference by introducing a series of topologies defined on general spaces of structural causal models (SCMs). As an illustration of the framework we prove a topological causal hierarchy theorem, showing that substantive assumption-free causal inference is possible only in a meager set of SCMs. Thanks to a known correspondence between open sets in the weak topology and statistically verifiable hypotheses, our results show that inductive assumptions sufficient to license valid causal inferences are statistically unverifiable in principle. Similar to no-free-lunch theorems for statistical inference, the present results clarify the inevitability of substantial assumptions for causal inference. An additional benefit of our topological approach is that it easily accommodates SCMs with infinitely many variables. We finally suggest that the framework may be helpful for the positive project of exploring and assessing alternative causal-inductive assumptions.
    Grid HTM: Hierarchical Temporal Memory for Anomaly Detection in Videos. (arXiv:2205.15407v1 [cs.CV])
    The interest for video anomaly detection systems has gained traction for the past few years. The current approaches use deep learning to perform anomaly detection in videos, but this approach has multiple problems. For starters, deep learning in general has issues with noise, concept drift, explainability, and training data volumes. Additionally, anomaly detection in itself is a complex task and faces challenges such as unknowness, heterogeneity, and class imbalance. Anomaly detection using deep learning is therefore mainly constrained to generative models such as generative adversarial networks and autoencoders due to their unsupervised nature, but even they suffer from general deep learning issues and are hard to train properly. In this paper, we explore the capabilities of the Hierarchical Temporal Memory (HTM) algorithm to perform anomaly detection in videos, as it has favorable properties such as noise tolerance and online learning which combats concept drift. We introduce a novel version of HTM, namely, Grid HTM, which is an HTM-based architecture specifically for anomaly detection in complex videos such as surveillance footage.
    GLDQN: Explicitly Parameterized Quantile Reinforcement Learning for Waste Reduction. (arXiv:2205.15455v1 [cs.LG])
    We study the problem of restocking a grocery store's inventory with perishable items over time, from a distributional point of view. The objective is to maximize sales while minimizing waste, with uncertainty about the actual consumption by costumers. This problem is of a high relevance today, given the growing demand for food and the impact of food waste on the environment, the economy, and purchasing power. We frame inventory restocking as a new reinforcement learning task that exhibits stochastic behavior conditioned on the agent's actions, making the environment partially observable. We introduce a new reinforcement learning environment based on real grocery store data and expert knowledge. This environment is highly stochastic, and presents a unique challenge for reinforcement learning practitioners. We show that uncertainty about the future behavior of the environment is not handled well by classical supply chain algorithms, and that distributional approaches are a good way to account for the uncertainty. We also present GLDQN, a new distributional reinforcement learning algorithm that learns a generalized lambda distribution over the reward space. We show that GLDQN outperforms other distributional reinforcement learning approaches in our partially observable environments, in both overall reward and generated waste.
    Associative Learning Mechanism for Drug-Target Interaction Prediction. (arXiv:2205.15364v1 [q-bio.BM])
    As a necessary process in drug development, finding a drug compound that can selectively bind to a specific protein is highly challenging and costly. Drug-target affinity (DTA), which represents the strength of drug-target interaction (DTI), has played an important role in the DTI prediction task over the past decade. Although deep learning has been applied to DTA-related research, existing solutions ignore fundamental correlations between molecular substructures in molecular representation learning of drug compound molecules/protein targets. Moreover, traditional methods lack the interpretability of the DTA prediction process. This results in missing feature information of intermolecular interactions, thereby affecting prediction performance. Therefore, this paper proposes a DTA prediction method with interactive learning and an autoencoder mechanism. The proposed model enhances the corresponding ability to capture the feature information of a single molecular sequence by the drug/protein molecular representation learning module and supplements the information interaction between molecular sequence pairs by the interactive information learning module. The DTA value prediction module fuses the drug-target pair interaction information to output the predicted value of DTA. Additionally, this paper theoretically proves that the proposed method maximizes evidence lower bound (ELBO) for the joint distribution of the DTA prediction model, which enhances the consistency of the probability distribution between the actual value and the predicted value. The experimental results confirm mutual transformer-drug target affinity (MT-DTA) achieves better performance than other comparative methods.
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v1 [cs.LG])
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
    Truly Deterministic Policy Optimization. (arXiv:2205.15379v1 [cs.AI])
    In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments -- one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps) -- for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3). Our implementation with all the experimental settings is available at https://github.com/ehsansaleh/code_tdpo
    Learning Risk-Averse Equilibria in Multi-Agent Systems. (arXiv:2205.15434v1 [cs.LG])
    In multi-agent systems, intelligent agents are tasked with making decisions that have optimal outcomes when the actions of the other agents are as expected, whilst also being prepared for unexpected behaviour. In this work, we introduce a new risk-averse solution concept that allows the learner to accommodate unexpected actions by finding the minimum variance strategy given any level of expected return. We prove the existence of such a risk-averse equilibrium, and propose one fictitious-play type learning algorithm for smaller games that enjoys provable convergence guarantees in certain games classes (e.g., zero-sum or potential). Furthermore, we propose an approximation method for larger games based on iterative population-based training that generates a population of risk-averse agents. Empirically, our equilibrium is shown to be able to reduce the reward variance, specifically in the sense that off-equilibrium behaviour has a far smaller impact on our risk-averse agents in comparison to playing other equilibrium solutions. Importantly, we show that our population of agents that approximate a risk-averse equilibrium is particularly effective in the presence of unseen opposing populations, especially in the case of guaranteeing a minimal level of performance which is critical to safety-aware multi-agent systems.
    Improvements to Supervised EM Learning of Shared Kernel Models by Feature Space Partitioning. (arXiv:2205.15304v1 [cs.LG])
    Expectation maximisation (EM) is usually thought of as an unsupervised learning method for estimating the parameters of a mixture distribution, however it can also be used for supervised learning when class labels are available. As such, EM has been applied to train neural nets including the probabilistic radial basis function (PRBF) network or shared kernel (SK) model. This paper addresses two major shortcomings of previous work in this area: the lack of rigour in the derivation of the EM training algorithm; and the computational complexity of the technique, which has limited it to low dimensional data sets. We first present a detailed derivation of EM for the Gaussian shared kernel model PRBF classifier, making use of data association theory to obtain the complete data likelihood, Baum's auxiliary function (the E-step) and its subsequent maximisation (M-step). To reduce complexity of the resulting SKEM algorithm, we partition the feature space into $R$ non-overlapping subsets of variables. The resulting product decomposition of the joint data likelihood, which is exact when the feature partitions are independent, allows the SKEM to be implemented in parallel and at $R^2$ times lower complexity. The operation of the partitioned SKEM algorithm is demonstrated on the MNIST data set and compared with its non-partitioned counterpart. It eventuates that improved performance at reduced complexity is achievable. Comparisons with standard classification algorithms are provided on a number of other benchmark data sets.
    Parameter-Efficient and Student-Friendly Knowledge Distillation. (arXiv:2205.15308v1 [cs.LG])
    Knowledge distillation (KD) has been extensively employed to transfer the knowledge from a large teacher model to the smaller students, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, teacher-student joint training methods, e.g., online distillation, have been proposed, but it always requires expensive computational cost. In this paper, we present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer by updating relatively few partial parameters. Technically, we first mathematically formulate the mismatch as the sharpness gap between their predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods. Code will be released upon acceptance.
    Attention Flows for General Transformers. (arXiv:2205.15389v1 [cs.LG])
    In this paper, we study the computation of how much an input token in a Transformer model influences its prediction. We formalize a method to construct a flow network out of the attention values of encoder-only Transformer models and extend it to general Transformer architectures including an auto-regressive decoder. We show that running a maxflow algorithm on the flow network construction yields Shapley values, which determine the impact of a player in cooperative game theory. By interpreting the input tokens in the flow network as players, we can compute their influence on the total attention flow leading to the decoder's decision. Additionally, we provide a library that computes and visualizes the attention flow of arbitrary Transformer models. We show the usefulness of our implementation on various models trained on natural language processing and reasoning tasks.
    Payday loans -- blessing or growth suppressor? Machine Learning Analysis. (arXiv:2205.15320v1 [econ.GN])
    The upsurge of real estate involves a variety of factors that have got influenced by many domains. Indeed, the unrecognized sector that would affect the economy for which regulatory proposals are being drafted to keep this in control is the payday loans. This research paper revolves around the impact of payday loans in the real estate market. The research paper draws a first-hand experience of obtaining the index for the concentration of real estate in an area of reference by virtue of payday loans in Toronto, Ontario in particular, which sets out an ideology to create, evaluate and demonstrate the scenario through research analysis. The purpose of this indexing via payday loans is the basic - debt: income ratio which states that when the income of the person bound to pay the interest of payday loans increases, his debt goes down marginally which hence infers that the person invests in fixed assets like real estate which hikes up its growth.
  • Open

    PAC Generalization via Invariant Representations. (arXiv:2205.15196v2 [cs.LG] UPDATED)
    One method for obtaining generalizable solutions to machine learning tasks when presented with diverse training environments is to find invariant representations of the data. These are representations of the covariates such that the best model on top of the representation is invariant across training environments. In the context of linear Structural Equation Models (SEMs), invariant representations might allow us to learn models with out-of-distribution guarantees, i.e., models that are robust to interventions in the SEM. To address the invariant representation problem in a finite sample setting, we consider the notion of $\epsilon$-approximate invariance. We study the following question: If a representation is approximately invariant with respect to a given number of training interventions, will it continue to be approximately invariant on a larger collection of unseen SEMs? This larger collection of SEMs is generated through a parameterized family of interventions. Inspired by PAC learning, we obtain finite-sample out-of-distribution generalization guarantees for approximate invariance that holds probabilistically over a family of linear SEMs without faithfulness assumptions. Our results show bounds that do not scale in ambient dimension when intervention sites are restricted to lie in a constant size subset of in-degree bounded nodes. We also show how to extend our results to a linear indirect observation model that incorporates latent variables.
    Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations. (arXiv:2203.01360v3 [math.NA] UPDATED)
    Deep neural networks have been shown to provide accurate function approximations in high dimensions. However, fitting network parameters requires training data that may not be available beforehand, which is particularly challenging in science and engineering applications where often it is even unclear how to collect new informative training data in the first place. This work proposes Neural Galerkin schemes based on deep learning that generate training data samples with active learning for numerically solving high-dimensional partial differential equations. Neural Galerkin schemes train networks by minimizing the residual sequentially over time, which enables adaptively collecting new training data in a self-informed manner that is guided by the dynamics described by the partial differential equations, which is in stark contrast to many other machine learning methods that aim to fit network parameters globally in time without taking into account training data acquisition. Our finding is that the active form of gathering training data of the proposed Neural Galerkin schemes is key for numerically realizing the expressive power of networks in high dimensions. Numerical experiments demonstrate that Neural Galerkin schemes have the potential to enable simulating phenomena and processes with many variables for which traditional and other deep-learning-based solvers fail, especially when features of the solutions evolve locally such as in high-dimensional wave propagation problems and interacting particle systems described by Fokker-Planck and kinetic equations.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v1 [cs.LG])
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require concept labels in the training data to learn the bottleneck and do not leverage strong pretrained models. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address the limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining interpretability benefits. When concept annotation is not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts. PCBM also enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new (potentially different) data. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using any data from the target domain or model retraining.
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v2 [cs.LG] UPDATED)
    With the increasingly broad deployment of Federated Learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. Motivated by its great success in wireless networks, in this work, we introduce and study Proportional Fairness (PF) in FL. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for finding fair solutions in FL, with its convergence proved. Through extensive experiments on a wide array of vision and language datasets, we demonstrate that PropFair consistently achieves a noticeable improvement of the worst 10% accuracy over state-of-the-art fair FL algorithms, while maintaining competitive overall performance.
    Static Scheduling with Predictions Learned through Efficient Exploration. (arXiv:2205.15695v1 [cs.LG])
    A popular approach to go beyond the worst-case analysis of online algorithms is to assume the existence of predictions that can be leveraged to improve performances. Those predictions are usually given by some external sources that cannot be fully trusted. Instead, we argue that trustful predictions can be built by algorithms, while they run. We investigate this idea in the illustrative context of static scheduling with exponential job sizes. Indeed, we prove that algorithms agnostic to this structure do not perform better than in the worst case. In contrast, when the expected job sizes are known, we show that the best algorithm using this information, called Follow-The-Perfect-Prediction (FTPP), exhibits much better performances. Then, we introduce two adaptive explore-then-commit types of algorithms: they both first (partially) learn expected job sizes and then follow FTPP once their self-predictions are confident enough. On the one hand, ETCU explores in "series", by completing jobs sequentially to acquire information. On the other hand, ETCRR, inspired by the optimal worst-case algorithm Round-Robin (RR), explores efficiently in "parallel". We prove that both of them asymptotically reach the performances of FTPP, with a faster rate for ETCRR. Those findings are empirically evaluated on synthetic data.
    Neural Topic Model via Optimal Transport. (arXiv:2008.13537v3 [cs.IR] UPDATED)
    Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.
    Hedging option books using neural-SDE market models. (arXiv:2205.15991v1 [q-fin.CP])
    We study the capability of arbitrage-free neural-SDE market models to yield effective strategies for hedging options. In particular, we derive sensitivity-based and minimum-variance-based hedging strategies using these models and examine their performance when applied to various option portfolios using real-world data. Through backtesting analysis over typical and stressed market periods, we show that neural-SDE market models achieve lower hedging errors than Black--Scholes delta and delta-vega hedging consistently over time, and are less sensitive to the tenor choice of hedging instruments. In addition, hedging using market models leads to similar performance to hedging using Heston models, while the former tends to be more robust during stressed market periods.
    Intrinsic Dimension Estimation Using Wasserstein Distances. (arXiv:2106.04018v2 [stat.ML] UPDATED)
    It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v1 [cs.LG])
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs.
    Variable importance without impossible data. (arXiv:2205.15750v1 [cs.LG])
    The most popular methods for measuring importance of the variables in a black box prediction algorithm make use of synthetic inputs that combine predictor variables from multiple subjects. These inputs can be unlikely, physically impossible, or even logically impossible. As a result, the predictions for such cases can be based on data very unlike any the black box was trained on. We think that users cannot trust an explanation of the decision of a prediction algorithm when the explanation uses such values. Instead we advocate a method called Cohort Shapley that is grounded in economic game theory and unlike most other game theoretic methods, it uses only actually observed data to quantify variable importance. Cohort Shapley works by narrowing the cohort of subjects judged to be similar to a target subject on one or more features. A feature is important if using it to narrow the cohort makes a large difference to the cohort mean. We illustrate it on an algorithmic fairness problem where it is essential to attribute importance to protected variables that the model was not trained on. For every subject and every predictor variable, we can compute the importance of that predictor to the subject's predicted response or to their actual response. These values can be aggregated, for example over all Black subjects, and we propose a Bayesian bootstrap to quantify uncertainty in both individual and aggregate Shapley values.
    Nonconvex regularization for sparse neural networks. (arXiv:2004.11515v2 [math.OC] UPDATED)
    Convex $\ell_1$ regularization using an infinite dictionary of neurons has been suggested for constructing neural networks with desired approximation guarantees, but can be affected by an arbitrary amount of over-parametrization. This can lead to a loss of sparsity and result in networks with too many active neurons for the given data, in particular if the number of data samples is large. As a remedy, in this paper, a nonconvex regularization method is investigated in the context of shallow ReLU networks: We prove that in contrast to the convex approach, any resulting (locally optimal) network is finite even in the presence of infinite data (i.e., if the data distribution is known and the limiting case of infinite samples is considered). Moreover, we show that approximation guarantees and existing bounds on the network size for finite data are maintained.
    VC Theoretical Explanation of Double Descent. (arXiv:2205.15549v1 [stat.ML])
    There has been growing interest in generalization performance of large multilayer neural networks that can be trained to achieve zero training error, while generalizing well on test data. This regime is known as 'second descent' and it appears to contradict conventional view that optimal model complexity should reflect optimal balance between underfitting and overfitting, aka the bias-variance trade-off. This paper presents VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC generalization bounds. We illustrate an application of analytic VC-bounds for modeling double descent for classification problems, using empirical results for several learning methods, such as SVM, Least Squares, and Multilayer Perceptron classifiers. In addition, we discuss several possible reasons for misinterpretation of VC-theoretical results in the machine learning community.
    Robust Projection based Anomaly Extraction (RPE) in Univariate Time-Series. (arXiv:2205.15548v1 [stat.ML])
    This paper presents a novel, closed-form, and data/computation efficient online anomaly detection algorithm for time-series data. The proposed method, dubbed RPE, is a window-based method and in sharp contrast to the existing window-based methods, it is robust to the presence of anomalies in its window and it can distinguish the anomalies in time-stamp level. RPE leverages the linear structure of the trajectory matrix of the time-series and employs a robust projection step which makes the algorithm able to handle the presence of multiple arbitrarily large anomalies in its window. A closed-form/non-iterative algorithm for the robust projection step is provided and it is proved that it can identify the corrupted time-stamps. RPE is a great candidate for the applications where a large training data is not available which is the common scenario in the area of time-series. An extensive set of numerical experiments show that RPE can outperform the existing approaches with a notable margin.
    Learning brain MRI quality control: a multi-factorial generalization problem. (arXiv:2205.15898v1 [stat.ML])
    Due to the growing number of MRI data, automated quality control (QC) has become essential, especially for larger scale analysis. Several attempts have been made in order to develop reliable and scalable QC pipelines. However, the generalization of these methods on new data independent of those used for learning is a difficult problem because of the biases inherent in MRI data. This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets (ABIDE, N = 1102 and CATI derived datasets, N = 9037) used for both training and evaluation purposes. We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them. We further analyzed the site-wise and study-wise predicted classification probability distributions of the models without preprocessing trained on ABIDE and CATI data. Our main results were that a model using features extracted from MRIQC without preprocessing yielded the best results when trained and evaluated on large multi-center datasets with a heterogeneous population (an improvement of the ROC-AUC score on unseen data of 0.10 for the model trained on a subset of the CATI dataset). We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data. In spite of the performance improvement, the generalization abilities of the models remain questionable when looking at the site-wise/study-wise probability predictions and the optimal classification threshold derived from them.
    Kymatio: Scattering Transforms in Python. (arXiv:1812.11214v3 [cs.LG] UPDATED)
    The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at https://www.kymat.io/
    Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game. (arXiv:2205.15512v1 [cs.LG])
    Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimal performance has only been (nearly) achieved for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose two new algorithms, SPEVI+ and SPMVI+, for single-agent MDPs and two-player zero-sum Markov games (MGs), respectively. The proposed algorithms feature carefully crafted data splitting mechanisms and novel variance-reduction pessimistic estimators. Theoretical analysis demonstrates that they are capable of matching the performance lower bounds up to logarithmic factors. As a byproduct, a new performance lower bound is established for MGs, which tightens the existing results. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
    Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity. (arXiv:2205.15809v1 [stat.ML])
    We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone. This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum which requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
    A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. (arXiv:2006.14171v3 [cs.LG] UPDATED)
    In recent years, Deep Reinforcement Learning (DRL) algorithms have achieved state-of-the-art performance in many challenging strategy games. Because these games have complicated rules, an action sampled from the full discrete action distribution predicted by the learned policy is likely to be invalid according to the game rules (e.g., walking into a wall). The usual approach to deal with this problem in policy gradient algorithms is to "mask out" invalid actions and just sample from the set of valid actions. The implications of this process, however, remain under-investigated. In this paper, we 1) show theoretical justification for such a practice, 2) empirically demonstrate its importance as the space of invalid actions grows, and 3) provide further insights by evaluating different action masking regimes, such as removing masking after an agent has been trained using masking. The source code can be found at https://github.com/vwxyzjn/invalid-action-masking
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v4 [cs.LG] UPDATED)
    Black-box machine learning learning methods are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Distribution-free uncertainty quantification (distribution-free UQ) is a user-friendly paradigm for creating statistically rigorous confidence intervals/sets for such predictions. Critically, the intervals/sets are valid without distributional assumptions or model assumptions, possessing explicit guarantees even with finitely many datapoints. Moreover, they adapt to the difficulty of the input; when the input example is difficult, the uncertainty intervals/sets are large, signaling that the model might be wrong. Without much work and without retraining, one can use distribution-free methods on any underlying algorithm, such as a neural network, to produce confidence sets guaranteed to contain the ground truth with a user-specified probability, such as 90%. Indeed, the methods are easy-to-understand and general, applying to many modern prediction problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed at a reader interested in the practical implementation of distribution-free UQ who is not necessarily a statistician. We lead the reader through the practical theory and applications of distribution-free UQ, beginning with conformal prediction and culminating with distribution-free control of any risk, such as the false-discovery rate, false positive rate of out-of-distribution detection, and so on. We will include many explanatory illustrations, examples, and code samples in Python, with PyTorch syntax. The goal is to provide the reader a working understanding of distribution-free UQ, allowing them to put confidence intervals on their algorithms, with one self-contained document.
    Simulation-Based Inference with WALDO: Perfectly Calibrated Confidence Regions Using Any Prediction or Posterior Estimation Algorithm. (arXiv:2205.15680v1 [stat.ML])
    The vast majority of modern machine learning targets prediction problems, with algorithms such as Deep Neural Networks revolutionizing the accuracy of point predictions for high-dimensional complex data. Predictive approaches are now used in many domain sciences to directly estimate internal parameters of interest in theoretical simulator-based models. In parallel, common alternatives focus on estimating the full posterior using modern neural density estimators such as normalizing flows. However, an open problem in simulation-based inference (SBI) is how to construct properly calibrated confidence regions for internal parameters with nominal conditional coverage and high power. Many SBI methods are indeed known to produce overly confident posterior approximations, yielding misleading uncertainty estimates. Similarly, existing approaches for uncertainty quantification in deep learning provide no guarantees on conditional coverage. In this work, we present WALDO, a novel method for constructing correctly calibrated confidence regions in SBI. WALDO reframes the well-known Wald test and uses Neyman inversion to convert point predictions and posteriors from any prediction or posterior estimation algorithm to confidence sets with correct conditional coverage, even for finite sample sizes. As a concrete example, we demonstrate how a recently proposed deep learning prediction approach for particle energies in high-energy physics can be recalibrated using WALDO to produce confidence intervals with correct coverage and high power.
    Infinite-dimensional optimization and Bayesian nonparametric learning of stochastic differential equations. (arXiv:2205.15368v1 [stat.ML])
    The paper has two major themes. The first part of the paper establishes certain general results for infinite-dimensional optimization problems on Hilbert spaces. These results cover the classical representer theorem and many of its variants as special cases and offer a wider scope of applications. The second part of the paper then develops a systematic approach for learning the drift function of a stochastic differential equation by integrating the results of the first part with Bayesian hierarchical framework. Importantly, our Baysian approach incorporates low-cost sparse learning through proper use of shrinkage priors while allowing proper quantification of uncertainty through posterior distributions. Several examples at the end illustrate the accuracy of our learning scheme.
    Minimax Classification under Concept Drift with Multidimensional Adaptation and Performance Guarantees. (arXiv:2205.15942v1 [stat.ML])
    The statistical characteristics of instance-label pairs often change with time in practical scenarios of supervised classification. Conventional learning techniques adapt to such concept drift accounting for a scalar rate of change by means of a carefully chosen learning rate, forgetting factor, or window size. However, the time changes in common scenarios are multidimensional, i.e., different statistical characteristics often change in a different manner. This paper presents adaptive minimax risk classifiers (AMRCs) that account for multidimensional time changes by means of a multivariate and high-order tracking of the time-varying underlying distribution. In addition, differently from conventional techniques, AMRCs can provide computable tight performance guarantees. Experiments on multiple benchmark datasets show the classification improvement of AMRCs compared to the state-of-the-art and the reliability of the presented performance guarantees.
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v1 [cs.LG])
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
    One Policy is Enough: Parallel Exploration with a Single Policy is Minimax Optimal for Reward-Free Reinforcement Learning. (arXiv:2205.15891v1 [cs.LG])
    While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL for linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is minimax optimal up to logarithmic factors in the reward-free setting for both linear MDPs and two-player zero-sum MGs. From a practical perspective, our paper shows that a single policy is sufficient and provably optimal for incorporating parallelism during the exploration phase.
    Inducing bias is simpler than you think. (arXiv:2205.15935v1 [cs.LG])
    Machine learning may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. To counter this, some of the model accuracy can be traded off for a secondary objective that helps prevent a specific type of bias. Multiple notions of fairness have been proposed to this end but recent studies show that some fairness criteria often stand in mutual competition. In the present work, we introduce a solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical behaviour of learning models trained in our synthetic framework and find similar unfairness behaviours as those observed on more realistic data. However, we also identify a positive transfer effect between the different subpopulations within the data. This suggests that mixing data with different statistical properties could be helpful, provided the learning model is made aware of this structure. Finally, we analyse the issue of bias mitigation: by reweighing the various terms in the training loss, we indirectly minimise standard unfairness metrics and highlight their incompatibilities. Leveraging the insights on positive transfer, we also propose a theory-informed mitigation strategy, based on the introduction of coupled learning models. By allowing each model to specialise on a different community within the data, we find that multiple fairness criteria and high accuracy can be achieved simultaneously.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v1 [cs.LG])
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Posterior and Computational Uncertainty in Gaussian Processes. (arXiv:2205.15449v1 [cs.LG])
    Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
    Fast Predictive Uncertainty for Classification with Bayesian Deep Networks. (arXiv:2003.01227v4 [cs.LG] UPDATED)
    In Bayesian Deep Learning, distributions over the output of classification neural networks are often approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the Categorical distribution) in the output space. Importantly, the vanilla Laplace Bridge comes with certain limitations. We analyze those and suggest a simple solution that compares favorably to other commonly used estimates of the softmax-Gaussian integral. We demonstrate that the resulting Dirichlet distribution has multiple advantages, in particular, more efficient computation of the uncertainty estimate and scaling to large datasets and networks like ImageNet and DenseNet. We further demonstrate the usefulness of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for ImageNet.
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v1 [stat.ML])
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of x, by providing an exact characterization of the functions $f$ to which impossibility applies.
    coVariance Neural Networks. (arXiv:2205.15856v1 [cs.LG])
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we propose a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.
    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap. (arXiv:2201.04469v6 [stat.ML] UPDATED)
    We consider fixed-budget best arm identification in two-armed bandit problems. One of the longstanding open questions is a tight lower bound on the probability of misidentifying the best arm and a strategy whose upper bound matches the lower bound when the optimal target allocation ratio of arm draws is unknown. We address this problem when the gap between the expected rewards is small. First, we introduce a distribution-dependent lower bound. Then, we propose the ``RS-AIPW'' strategy, which consists of the random sampling (RS) rule using the estimated optimal target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed strategy is optimal in the sense that the upper bound achieves the lower bound when the budget goes to infinity and the gap goes to zero. In the course of the analysis, we present a novel large deviation bound for martingales.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v1 [cs.LG])
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    The CLRS Algorithmic Reasoning Benchmark. (arXiv:2205.15659v1 [cs.LG])
    Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms. Several important works have investigated whether neural networks can effectively reason like algorithms, typically by learning to execute them. The common trend in the area, however, is to generate targeted kinds of algorithmic data to evaluate specific hypotheses, making results hard to transfer across publications, and increasing the barrier of entry. To consolidate progress and work towards unified evaluation, we propose the CLRS Algorithmic Reasoning Benchmark, covering classical algorithms from the Introduction to Algorithms textbook. Our benchmark spans a variety of algorithmic reasoning procedures, including sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms. We perform extensive experiments to demonstrate how several popular algorithmic reasoning baselines perform on these tasks, and consequently, highlight links to several open challenges. Our library is readily available at https://github.com/deepmind/clrs.  ( 2 min )
    Polynomial-time Sparse Deconvolution. (arXiv:2204.07879v2 [cs.LG] UPDATED)
    How can a probability measure be recovered with sparse support from its generalized moments? This problem, called sparse deconvolution, has been the focus of research in mathematics, theoretical computer science, and neural computing. However, there is no polynomial-time algorithm for the recovery. The best algorithm requires $O\left(\text{dimension}^{\text{poly}(1/\epsilon)}\right)$ for $\epsilon$-accurate recovery. We propose the first poly-time recovery method from carefully designed moments that requires $O\left(\text{dimension}^4\log(1/\epsilon)/\epsilon^2\right)$ computations for an $\epsilon$-accurate recovery. This method relies on learning a planted two-layer neural network with two-dimensional inputs, a finite width, and zero-one activation. For learning such networks, we establish the first poly-time complexity, and demonstrate its application in sparse deconvolution.  ( 2 min )
    PDE-based Group Equivariant Convolutional Neural Networks. (arXiv:2001.09046v6 [cs.LG] UPDATED)
    We present a PDE-based framework that generalizes Group equivariant Convolutional Neural Networks (G-CNNs). In this framework, a network layer is seen as a set of PDE-solvers where geometrically meaningful PDE-coefficients become the layer's trainable weights. Formulating our PDEs on homogeneous spaces allows these networks to be designed with built-in symmetries such as rotation in addition to the standard translation equivariance of CNNs. Having all the desired symmetries included in the design obviates the need to include them by means of costly techniques such as data augmentation. We will discuss our PDE-based G-CNNs (PDE-G-CNNs) in a general homogeneous space setting while also going into the specifics of our primary case of interest: roto-translation equivariance. We solve the PDE of interest by a combination of linear group convolutions and non-linear morphological group convolutions with analytic kernel approximations that we underpin with formal theorems. Our kernel approximations allow for fast GPU-implementation of the PDE-solvers, we release our implementation with this article in the form of the LieTorch extension to PyTorch, available at https://gitlab.com/bsmetsjr/lietorch . Just like for linear convolution a morphological convolution is specified by a kernel that we train in our PDE-G-CNNs. In PDE-G-CNNs we do not use non-linearities such as max/min-pooling and ReLUs as they are already subsumed by morphological convolutions. We present a set of experiments to demonstrate the strength of the proposed PDE-G-CNNs in increasing the performance of deep learning based imaging applications with far fewer parameters than traditional CNNs.  ( 3 min )
    Online Meta-Learning in Adversarial Multi-Armed Bandits. (arXiv:2205.15921v1 [cs.LG])
    We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.  ( 2 min )
    Will Bilevel Optimizers Benefit from Loops. (arXiv:2205.14224v2 [cs.LG] UPDATED)
    Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.  ( 2 min )
    Variational inference via Wasserstein gradient flows. (arXiv:2205.15902v1 [stat.ML])
    Along with Markov chain Monte Carlo (MCMC) methods, variational inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior $\pi$, VI aims at producing a simple but effective approximation $\hat \pi$ to $\pi$ for which summary statistics are easy to compute. However, unlike the well-studied MCMC methodology, VI is still poorly understood and dominated by heuristics. In this work, we propose principled methods for VI, in which $\hat \pi$ is taken to be a Gaussian or a mixture of Gaussians, which rest upon the theory of gradient flows on the Bures-Wasserstein space of Gaussian measures. Akin to MCMC, it comes with strong theoretical guarantees when $\pi$ is log-concave.  ( 2 min )
    Optimally adaptive Bayesian spectral density estimation for stationary and nonstationary processes. (arXiv:2003.02367v3 [stat.ME] UPDATED)
    This article improves on existing methods to estimate the spectral density of stationary and nonstationary time series assuming a Gaussian process prior. By optimising an appropriate eigendecomposition using a smoothing spline covariance structure, our method more appropriately models data with both simple and complex periodic structure. We further justify the utility of this optimal eigendecomposition by investigating the performance of alternative covariance functions other than smoothing splines. We show that the optimal eigendecomposition provides a material improvement, while the other covariance functions under examination do not, all performing comparatively well as the smoothing spline. During our computational investigation, we introduce new validation metrics for the spectral density estimate, inspired from the physical sciences. We validate our models in an extensive simulation study and demonstrate superior performance with real data.  ( 2 min )
    Unbalanced CO-Optimal Transport. (arXiv:2205.14923v2 [stat.ML] UPDATED)
    Optimal transport (OT) compares probability distributions by computing a meaningful alignment between their samples. CO-optimal transport (COOT) takes this comparison further by inferring an alignment between features as well. While this approach leads to better alignments and generalizes both OT and Gromov-Wasserstein distances, we provide a theoretical result showing that it is sensitive to outliers that are omnipresent in real-world data. This prompts us to propose unbalanced COOT for which we provably show its robustness to noise in the compared datasets. To the best of our knowledge, this is the first such result for OT methods in incomparable spaces. With this result in hand, we provide empirical evidence of this robustness for the challenging tasks of heterogeneous domain adaptation with and without varying proportions of classes and simultaneous alignment of samples and features across single-cell measurements.  ( 2 min )
    Few-Shot Diffusion Models. (arXiv:2205.15463v1 [cs.CV])
    Denoising diffusion probabilistic models (DDPM) are powerful hierarchical latent variable models with remarkable sample generation quality and training stability. These properties can be attributed to parameter sharing in the generative hierarchy, as well as a parameter-free diffusion-based inference procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs. FSDMs are trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information using a set-based Vision Transformer (ViT). At test time, the model is able to generate samples from previously unseen classes conditioned on as few as 5 samples from that class. We empirically show that FSDM can perform few-shot generation and transfer to new datasets. We benchmark variants of our method on complex vision datasets for few-shot learning and compare to unconditional and conditional DDPM baselines. Additionally, we show how conditioning the model on patch-based input set information improves training convergence.  ( 2 min )
    QLSD: Quantised Langevin stochastic dynamics for Bayesian federated learning. (arXiv:2106.00797v3 [cs.LG] UPDATED)
    The objective of Federated Learning (FL) is to perform statistical inference for data which are decentralised and stored locally on networked clients. FL raises many constraints which include privacy and data ownership, communication overhead, statistical heterogeneity, and partial client participation. In this paper, we address these problems in the framework of the Bayesian paradigm. To this end, we propose a novel federated Markov Chain Monte Carlo algorithm, referred to as Quantised Langevin Stochastic Dynamics which may be seen as an extension to the FL setting of Stochastic Gradient Langevin Dynamics, which handles the communication bottleneck using gradient compression. To improve performance, we then introduce variance reduction techniques, which lead to two improved versions coined \texttt{QLSD}$^{\star}$ and \texttt{QLSD}$^{++}$. We give both non-asymptotic and asymptotic convergence guarantees for the proposed algorithms. We illustrate their performances using various Bayesian Federated Learning benchmarks.  ( 2 min )
    Smoothed Online Learning is as Easy as Statistical Learning. (arXiv:2202.04690v3 [stat.ML] UPDATED)
    Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.  ( 2 min )
    Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms. (arXiv:2009.09538v2 [cs.LG] UPDATED)
    EXP-based algorithms are often used for exploration in non-stochastic bandit problems assuming rewards are bounded. We propose a new algorithm, namely EXP4.P, by modifying EXP4 and establish its upper bound of regret in both bounded and unbounded sub-Gaussian contextual bandit settings. The unbounded reward result also holds for a revised version of EXP3.P. Moreover, we provide a lower bound on regret that suggests no sublinear regret can be achieved given short time horizon. All the analyses do not require bounded rewards compared to classical ones. We also extend EXP4.P from contextual bandit to reinforcement learning to incentivize exploration by multiple agents given black-box rewards. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.  ( 2 min )
    Cross-view kernel transfer. (arXiv:1910.05964v2 [cs.LG] UPDATED)
    We consider the kernel completion problem with the presence of multiple views in the data. In this context the data samples can be fully missing in some views, creating missing columns and rows to the kernel matrices that are calculated individually for each view. We propose to solve the problem of completing the kernel matrices with Cross-View Kernel Transfer (CVKT) procedure, in which the features of the other views are transformed to represent the view under consideration. The transformations are learned with kernel alignment to the known part of the kernel matrix, allowing for finding generalizable structures in the kernel matrix under completion. Its missing values can then be predicted with the data available in other views. We illustrate the benefits of our approach with simulated data, multivariate digits dataset and multi-view dataset on gesture classification, as well as with real biological datasets from studies of pattern formation in early \textit{Drosophila melanogaster} embryogenesis.  ( 2 min )
    Learning (Very) Simple Generative Models Is Hard. (arXiv:2205.16003v1 [cs.LG])
    Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network $F:\mathbb{R}^d\to\mathbb{R}^{d'}$, let $D$ be the distribution over $\mathbb{R}^{d'}$ given by pushing the standard Gaussian $\mathcal{N}(0,\textrm{Id}_d)$ through $F$. Given i.i.d. samples from $D$, the goal is to output any distribution close to $D$ in statistical distance. We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of $F$ are one-hidden-layer ReLU networks with $\log(d)$ neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for supervised learning and required at least two hidden layers and $\mathrm{poly}(d)$ neurons [Daniely-Vardi '21, Chen-Gollakota-Klivans-Meka '22]. The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with polynomially-bounded slopes such that the pushforward of $\mathcal{N}(0,1)$ under $f$ matches all low-degree moments of $\mathcal{N}(0,1)$.  ( 2 min )
    Improvements to Supervised EM Learning of Shared Kernel Models by Feature Space Partitioning. (arXiv:2205.15304v1 [cs.LG])
    Expectation maximisation (EM) is usually thought of as an unsupervised learning method for estimating the parameters of a mixture distribution, however it can also be used for supervised learning when class labels are available. As such, EM has been applied to train neural nets including the probabilistic radial basis function (PRBF) network or shared kernel (SK) model. This paper addresses two major shortcomings of previous work in this area: the lack of rigour in the derivation of the EM training algorithm; and the computational complexity of the technique, which has limited it to low dimensional data sets. We first present a detailed derivation of EM for the Gaussian shared kernel model PRBF classifier, making use of data association theory to obtain the complete data likelihood, Baum's auxiliary function (the E-step) and its subsequent maximisation (M-step). To reduce complexity of the resulting SKEM algorithm, we partition the feature space into $R$ non-overlapping subsets of variables. The resulting product decomposition of the joint data likelihood, which is exact when the feature partitions are independent, allows the SKEM to be implemented in parallel and at $R^2$ times lower complexity. The operation of the partitioned SKEM algorithm is demonstrated on the MNIST data set and compared with its non-partitioned counterpart. It eventuates that improved performance at reduced complexity is achievable. Comparisons with standard classification algorithms are provided on a number of other benchmark data sets.  ( 2 min )
    Optimal Transport of Classifiers to Fairness. (arXiv:2202.03814v2 [cs.LG] UPDATED)
    In past work on fairness in machine learning, the focus has been on forcing the prediction of classifiers to have similar statistical properties for people of different demographics. To reduce the violation of these properties, fairness methods usually simply rescale the classifier scores, ignoring similarities and dissimilarities between members of different groups. Yet, we hypothesize that such information is relevant in quantifying the unfairness of a given classifier. To validate this hypothesis, we introduce Optimal Transport to Fairness (OTF), a method that quantifies the violation of fairness constraints as the smallest Optimal Transport cost between a probabilistic classifier and any score function that satisfies these constraints. For a flexible class of linear fairness constraints, we construct a practical way to compute OTF as a differentiable fairness regularizer that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an improved trade-off between predictive power and fairness.  ( 2 min )
    Ensemble methods for survival function estimation with time-varying covariates. (arXiv:2006.00567v6 [stat.AP] UPDATED)
    Survival data with time-varying covariates are common in practice. If relevant, they can improve on the estimation of survival function. However, the traditional survival forests - conditional inference forest, relative risk forest and random survival forest - have accommodated only time-invariant covariates. We generalize the conditional inference and relative risk forests to allow time-varying covariates. We also propose a general framework for estimation of a survival function in the presence of time-varying covariates. We compare their performance with that of the Cox model and transformation forest, adapted here to accommodate time-varying covariates, through a comprehensive simulation study in which the Kaplan-Meier estimate serves as a benchmark, and performance is compared using the integrated L2 difference between the true and estimated survival functions. In general, the performance of the two proposed forests substantially improves over the Kaplan-Meier estimate. Taking into account all other factors, under the proportional hazard (PH) setting, the best method is always one of the two proposed forests, while under the non-PH setting, it is the adapted transformation forest. K-fold cross-validation is used as an effective tool to choose between the methods in practice.  ( 2 min )
    Critic Sequential Monte Carlo. (arXiv:2205.15460v1 [stat.ML])
    We introduce CriticSMC, a new algorithm for planning as inference built from a novel composition of sequential Monte Carlo with learned soft-Q function heuristic factors. This algorithm is structured so as to allow using large numbers of putative particles leading to efficient utilization of computational resource and effective discovery of high reward trajectories even in environments with difficult reward surfaces such as those arising from hard constraints. Relative to prior art our approach is notably still compatible with model-free reinforcement learning in the sense that the implicit policy we produce can be used at test time in the absence of a world model. Our experiments on self-driving car collision avoidance in simulation demonstrate improvements against baselines in terms of infraction minimization relative to computational effort while maintaining diversity and realism of found trajectories.  ( 2 min )
    Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization. (arXiv:2205.15784v1 [stat.CO])
    Bayesian Likelihood-Free Inference methods yield posterior approximations for simulator models with intractable likelihood. Recently, many works trained neural networks to approximate either the intractable likelihood or the posterior directly. Most proposals use normalizing flows, namely neural networks parametrizing invertible maps used to transform samples from an underlying base measure; the probability density of the transformed samples is then accessible and the normalizing flow can be trained via maximum likelihood on simulated parameter-observation pairs. A recent work [Ramesh et al., 2022] approximated instead the posterior with generative networks, which drop the invertibility requirement and are thus a more flexible class of distributions scaling to high-dimensional and structured data. However, generative networks only allow sampling from the parametrized distribution; for this reason, Ramesh et al. [2022] follows the common solution of adversarial training, where the generative network plays a min-max game against a "critic" network. This procedure is unstable and can lead to a learned distribution underestimating the uncertainty - in extreme cases collapsing to a single point. Here, we propose to approximate the posterior with generative networks trained by Scoring Rule minimization, an overlooked adversarial-free method enabling smooth training and better uncertainty quantification. In simulation studies, the Scoring Rule approach yields better performances with shorter training time with respect to the adversarial framework.  ( 2 min )
    Holistic Generalized Linear Models. (arXiv:2205.15447v1 [stat.ML])
    Holistic linear regression extends the classical best subset selection problem by adding additional constraints designed to improve the model quality. These constraints include sparsity-inducing constraints, sign-coherence constraints and linear constraints. The $\textsf{R}$ package $\texttt{holiglm}$ provides functionality to model and fit holistic generalized linear models. By making use of state-of-the-art conic mixed-integer solvers, the package can reliably solve GLMs for Gaussian, binomial and Poisson responses with a multitude of holistic constraints. The high-level interface simplifies the constraint specification and can be used as a drop-in replacement for the $\texttt{stats::glm()}$ function.  ( 2 min )
  • Open

    Qualitative humanities research is crucial to AI
    “All research is qualitative; some is also quantitative” Harvard Social Scientist and Statistician Gary King Suppose you wanted to find out whether a machine learning system being adopted - to recruit candidates, lend money, or predict future criminality - exhibited racial bias. You might calculate model performance across groups with different races. But how was race categorised– through a census record, a police officer’s guess, or by an annotator? Each possible answer raises another set of questions. Following the thread of any seemingly quantitative issue around AI ethics quickly leads to a host of qualitative questions. Throughout AI, qualitative decisions are made about what metrics to optimise for, which categories to use, how to define their bounds, who applies the labels. Simila…  ( 8 min )

  • Open

    How to Optimize your AI Models at Inference Time
    submitted by /u/aidev2040 [link] [comments]
    New AI Robot Hand Designer Creates Hyper Efficient Manipulators | Self Driving Autonomous Vehicle AI For Spacecraft | Coral Reef AI Tool
    submitted by /u/tohelpyou88 [link] [comments]  ( 1 min )
    Why are neural networks not multi-dimensional?
    Title says almost everything. I believe this might solve sequential deep learning once and for all (as opposed to current methods such as LSTM, the architecture of which I find arbitrary). Although I can envision multi-dimensional neural networks in multiple shapes, I will elaborate on what I think is the most straight-forward shape: ​ Input Imagine the most widely used example of sequential data: The moving ball. Input data is 2d, structured in the following way: Axis 0: [x-coordinate, y-coordinate] (not to confuse with X as in input and y as in output) Axis 1: [timestep 1, ..., timestep k] ​ As such, one would end up with the following 2d-array as input: [[x-coordinate, y-coordinate]t-k, ..., [x-coordinate, y-coordinate]t0] ​ Hidden layers Now imagine the following 3d structure used as hidden layers: Axis 0: A single hidden layer, i.e., [neuron, ..., neuron] Axis 1: An entire network, i.e., [layer, ..., layer] Axis 2: A web of identical networks, each connected to the input layer and network corresponding to its own timestep and the next. ​ As such, one would end up with the following 3d structure as hidden layers: [[[neuron, ..., neuron], ..., [neuron, ..., neuron]]t-k, ..., [[neuron, ..., neuron], ..., [neuron, ..., neuron]]t-0] But I can also image that the hidden layers is only a 2d structure. ​ Output In the case of this data, the output is 1d: Axis 0: [x-coordinatet+1, y-coordinatet+1] Only the network corresponding to the last visible timestep will produce usable output. ​ Shematic I have created a schema, in which the arrows denote which element is connected to which: https://ibb.co/LzZM6w6 ​ I can imagine one concern as to why this has not been realized yet; For every added dimension, computational complexity increases quadratically. Is there any other concern and/or is my thinking sound? submitted by /u/Thijs-vW [link] [comments]  ( 5 min )
    Data platform beta test: private beta of customizable schema to fit your dataset formats
    Hi everyone, my name is Taylor and I work at Graviti - We are a cloud data platform for ML practitioners to better and faster manage unstructured data at a large scale. The platform hands developers the ability to do data query, version control, visualization and workflow automation on all types of data based on our powerful compute engine. Now we are launching a private beta of Graviti data platform v3.0 with a new feature -custom schema, which allows you to manage heterogeneous data in a tabular data model and fit your own data formats. Our goal is to find more potential users and receive their honest feedback from the test as well as help us co-build a better data platform for AI and machine learning. We need a group of people from the community who work closely with data in direction of computer vision, NLP, etc, and will be eager to test our data platform, share feedback and help us make it the best fit for more machine learning teams. We appreciate your time and valuable contribution and offer rewards of 3 months of free usage of Graviti data platform(compute included) as well as an Amazon gift card. Interested? Here is our application form. (We will process the application in 48 hours and contact you with further details. ) Feel free to leave comments or any thoughts here. Thank you! submitted by /u/Strong_Bookkeeper_78 [link] [comments]  ( 1 min )
  • Open

    How to Optimize your AI Models at Inference Time
    submitted by /u/aidev2040 [link] [comments]
    AI MSc or Physics PhD?
    So I've got two options to choose from, and I was hoping that you could give your opinion about it. The first option is to do a (three year) PhD in physics. The other option is to do another master's in artificial intelligence. I'll also post this on some physics subreddit. But first a little bit about my background. I've got a BSc and (almost) MSc degree in physics. For my master's thesis, I'm developing machine learning techniques for better posterior estimation for astrophysical data. Obviously, I'm very interested in the further development of ML tools for astro/particle physics applications. The PhD is all about that, developing ML tools for astrophysics. However, I have some concerns about it. First, I am not sure if I want to stay in academia after my PhD. So would it even be a good idea to try to do a PhD? If I want to go into ML industry (my dream would be something like Google) after my PhD, how does a physics PhD heavily focused on ML compares to an AI PhD or MSc? The other option would be to do the AI master's after my physics master's. There are a few things that appeal to me about it. First, I can stay in the city I live currently live in. For the PhD, I would have to move abroad. Moving abroad for three years kinda scares me. Secondly, the diversity of what I will learn will be a lot more compared to the PhD. During the master's you will be taught at a high pace a lot of very interesting cutting edge ML techniques, while during the PhD I will probably focus on a few in more detail. If I would choose to skip the physics PhD and go for the AI MSc, my plan would be to start looking for AI and/or physics PhDs before close to graduating. Maybe important to know: this all takes place in north-west Europe, where one always does a master's before a PhD. Both universities are good (top ~100) but not top-notch. submitted by /u/elipeli54 [link] [comments]  ( 2 min )
    AI Robot Hand Designer Creates Hyper Efficient Manipulators | Self Driving Autonomous Vehicle AI For Spacecraft | Coral Reef AI Measure Ecosystem Health
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    In this series, we break down the basics of starting an NLP project from scratch. Check out the blog to learn more.
    submitted by /u/UBIAI [link] [comments]
    Fundamental ethical objection to seeking AGI?
    I came across a philosophical / ethical argument against seeking AGI the other day that i can't see a way past. Its extremely hypothetical with respect to our current progress with AI but I was curious what others make of it. Basically it goes like this. As we make AI more and more sophisticated we gradually scale up the level of consciousness, say its comparable to an insect (maybe we are close now?) to a cat to a chimp to a child, to a grown human etc. Most people would say that the further along this scale you are the more capable of suffering you are and the more rights you should have. So given the 'ease' in which computer programs are run and deleted etc we could foresee that in the quest for AGI we could create and 'kill' billions of entities of comparable consciousness of a chimp or human child. So if it is possible to make an AGI, it will by definition require experimentation on many billions of near AGI, which by definition is morally equivalent to mass experimentation / death of child-like beings. I see huge potential for all forms of AI for making the world better but the above seems unconscionable to me. Obviously this is all in the realm of sci fi now but given most of us here would like to reach some form of AGI, and given we think it is possible at some point how do we hypothetically get round this apparently fundamental issue? submitted by /u/bbbbbadtothe [link] [comments]  ( 3 min )
    AI Dream 53 - Cosmic Birth | Q: Grainy or Smooth?
    submitted by /u/LordPewPew777 [link] [comments]
    Why your AI chatbot doesn't get what you're asking it
    One of the typical reason – the AI chatbot is trained badly or you/conversational designer choose a bad strategy for your chatbot training. Also, when creating a dataset for NLP, the language aspect wasn't taken into account. Suppose you wanted to create a multilingual AI chatbot that speaks languages with different linguistic structures like German and Chinese. These languages have very different language structure from each other. But don't worry, I recently worked on guide that explores the basic steps in chatbot training before actual development and the best practices with conversational AI after the chatbot launch. Please let me know if I didn't cover sth. Read Tips on How to Train a Chatbot for Businesses submitted by /u/Avandegraund [link] [comments]  ( 1 min )
    Why are neural networks not multi-dimensional?
    submitted by /u/Thijs-vW [link] [comments]  ( 1 min )
    What Hugging Face and Microsoft’s collaboration means for applied AI
    submitted by /u/bendee983 [link] [comments]
    Pothole Detector based on YoloV4
    submitted by /u/Gloomy_Recognition_4 [link] [comments]  ( 1 min )
    Google Has Banned the Training of Deepfakes in Colab, See Why?
    submitted by /u/Dip14099 [link] [comments]
    can an ai write a dance review based on other reviews ?
    hi I had the idea to let ai write a dance review based on other reviews and then make a piece out of that, what would I need for that if it is even possible? if you could help me somehow please leave a comment submitted by /u/NolanDeC [link] [comments]  ( 1 min )
    Are there any AI powered tools that could act as my eyes in video games?
    So my friends usually have to help me get through games when I play with them since I am blind IRL, and I am excited for the AI industry to possibly make a tool in the future that would allow me to simply point it at a game window and have it direct me to places, explain what I'm looking at, read item tooltips, atc. Does something like t his exist? For example, if I'm on a game where there is typically an exclamation point to indicate a quest object, I coulda sk the AI to be on the lookout, and when it finds it it would let me know, and then direct me to it? This would be super useful if this is a thing, and I'm eager ot hear what you guys have to say! thanks! submitted by /u/ChipsAhoiMcCoy [link] [comments]  ( 2 min )
    AI Dreams - Dall-E 2, Wumbo AI, and Gigapixel AI - Thoughts on completely AI-Generated original artwork?
    submitted by /u/Dear_Watson [link] [comments]  ( 1 min )
    The Shapes And Patterns You'll Find In Higher Dimensions - 4K Creative Neural-Art Exploration
    submitted by /u/MLInsights [link] [comments]
  • Open

    [R] Solving Bayesian Inverse Problems via Variational Autoencoders
    Paper: https://proceedings.mlr.press/v145/goh22a.html Code: https://github.com/hwangoh/uq-vae This mathematically justified framework offers a data-driven approach to uncertainty quantification for Bayesian inverse problems. When a neural network comes into play, the information contained within a training dataset is embedded into the network; the output of which quantifies the uncertainty in the underlying parameter estimation problem using this information. submitted by /u/hwangoh [link] [comments]  ( 1 min )
    [D] DALL-E 2 Has Its Own Secret Language
    I discovered this through the thread here: Twitter Thread The full paper is here: Paper It seems that the garbled text that Dall-E 2 generates can be run back through the model to produce consistent results. Weird associations like “Apoploe vesrreaitais” meaning a specific type of bird in a lot of different contexts. Really cool find IMO! I guess it makes sense that if Dall-E can’t distinguish between these tokens semantically when generating images, it won’t be able distinguish between them as prompts. Anyone know of similar results / other explanations for this phenomenon? submitted by /u/NMister_ [link] [comments]  ( 2 min )
    [D] How to choose loss function for classification problem?
    Loss function is a penalty function over predictions and true labels. The use of various loss can have impact on the model performance. I wonder if there is a way to choose which loss function works better , in the context of classification problems. submitted by /u/flaubart9 [link] [comments]  ( 1 min )
    [D] [R] Any benchmark for entity-specific sentiment analysis?
    Is there a benchmark for entity-specific sentiment analysis? E.g., like what online sentiment analysis tools like Google Cloud's natural language API or Watson's NLP API does? (Example) E.g., if the sentence is "Google unveiled the new Android phone for $799", the sentiments would be about the entities, Google, Android etc. I'm trying to find out if there's a know benchmark for this task but have been coming up empty despite this task being a staple of all sentiment analysis tools. I am familiar with the aspect based sentiment analysis task (e.g., the SemEval-16 task 4.2). But it seemed to me like it's different from entity-specific sentiment analysis. The targets here are different aspects of the same entity. So, that was making me wonder, on if there's something more specific dataset that deals with entity-specific sentiment rather than aspect-based ones. submitted by /u/RustBucket03 [link] [comments]  ( 1 min )
    [R] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
    Paper: https://arxiv.org/abs/2205.14135 Twitter: https://twitter.com/tri_dao/status/1531437619791290369?t=UXOZXyk1p9CCrMJLlkDcDg&s=19 Abstract: " Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM…  ( 1 min )
    [D] Training audio for ASR in the face of extreme speaker/device imbalance
    Hi, Hope all is well. So, basically, I've recently transitioned into audio related field and was wondering how do people handle several issues regarding machine learning models for audio. The first issue is regarding speaker imbalance. The datasets that I've been working with usually have several prominent speakers that make up the majority of the dataset, and many speakers that contribute, say, less than 1% to the dataset. Assuming ASR (speech recognition) task, should I keep the same proportions, or should I apply various techniques to balance out the dataset? I've heard using zero shot tts (text to speech) for minory speakers to increase via synthetic samples is kind of valuable, but I'm not familiar with other modern approaches for balancing (specifically for audio, I suppose). I guess the same goes with different devices. The various datasets are recorded with various different devices, and, you guessed it, most come from a minority subset. Thanks for your responses in advance :) submitted by /u/Slowai [link] [comments]  ( 1 min )
    [D] Machine learning "for good"?
    I am finishing a Ph.D. in NLP/ML and starting to think about where to go next. I will likely start with an internship / some work in a big company (FAANG or similar, I have already passed some interviews) but if I think of spending my life working to ultimately increase click rates on products /ads I really get depressed. I am extremely attracted to companies with a high technical level but I also need to have the sensation of doing something meaningful (this does not necessarily mean working for an NGO, but there are many open problems worth solving, I believe). I just do not understand why FAANG can really make things work most of the time with the ultimate goal of making billionaires richer, while companies working on important problems are unorganized and ultimately often come up with solutions that are probably technically very far from the optimum. Not that I believe ML is gold and we will get AGI in 5 years (I see the whole marketing bubble) but if Uber can optimize their taxis you can optimize streets to minimize traffic, or the improve the 911 strategy. So (rant over) do you know of any company that deals with an "important" /real problem and has a high technical level? For example anything in health, education, public whatever, NGOs, poverty prevention, etc. submitted by /u/ombelicoInfinito [link] [comments]  ( 6 min )
    [D] mlflow vs kubeflow vs cortex vs Argo from a data scientist perspective?
    Hi I'm looking for a data scientist workflow solution - all the way from experimentation to deployment. But very oriented for a data scientist - which means a very dashboardy out-of-the-box experience. They are probably not interested in creating some custom flows using an SDK or something. Deployment and drift tracking would be nice to have...but not mandatory. I hear a lot about kubeflow...but then people seem to hate it as well. Mlflow is nice..but it seems you need to orchestrate it yourself (or deploy it on k8s anyway). Which one are you using in ur company for your work ? Not for personal projects...but a true org level experience in production? submitted by /u/sandys1 [link] [comments]  ( 1 min )
    [D] Is the KDD 2022 conference worth the $1100 price tag?
    The early bird cost is around~$1100. I work in industry so I'm not a researcher and I was looking to attend some conferences to expand my network. Conferences like ICML or NIPS seem way too theoretical and the people involved are on a completely different level compared to what I do day-to-day. I build xgboost models mostly and occasionally use some standard deep learning methods from blogposts for my work. I was hoping to attend conferences where I can connect with people more on this level instead of actual scientists who use game theory to solve eigenvectors (more power to them but I have little in common with these NASA types). Is the KDD the right conference for me? I'm open to any other suggestions as well. submitted by /u/sybar142857 [link] [comments]  ( 2 min )
    [D] Has anyone trained static word embeddings like fastText on a multilingual corpus, similar to XLM-R or mBERT?
    Training contextual (BERT-style) models on multilingual data seems pretty standard nowadays (XLM-R, mBERT, many more), but I could not find many resources on training static word embeddings like fastText on a multilingual corpus (simply monolingual data from many languages concatenated together). For static embeddings, I most see people aligning embeddings spaces of monolingual embeddings after they were trained. Has anyone tried this or knows some papers where it was tried? I'm curious if it would work or if these bigger types of models are necessary to pull it off. submitted by /u/optimized-adam [link] [comments]  ( 1 min )
    [D]:- NLG: Does it make sense to add a discriminator model on top of pre-trained language generation models like GPT, XLNet, etc?
    Background: I have a dataset of essays and some classes associated with them, the dataset is initially imbalanced, and collecting more data is not an option, so I was looking for Text Generation methods, I am trying to generate text samples that are almost identical to the original samples, so provided, I add keywords and classes as inputs to the generative model Is it necessary to add a discriminative model that predicts whether a sample is real or fake? submitted by /u/Creative_Jellyfish53 [link] [comments]  ( 1 min )
    [R] Multi-Game Decision Transformers
    Blog: https://sites.google.com/view/multi-game-transformers Paper: https://arxiv.org/pdf/2205.15241.pdf ​ https://preview.redd.it/mxritsjhjt291.png?width=1280&format=png&auto=webp&s=fe0a1a97483a0abdb553c849b46d527691fc658e Clarifies quite a lot of findings of GATO in a neat way. Scale helps (as always ;)), transfer learning capabilities are evident:- ... We hence devise our own evaluation setup by pretraining DT, CQL, CPC, BERT, and ACL on the full datasets of the 41 training games with 50M steps each, and fine-tuning one model per held-out game using 1% (500k steps) from each game... It also appears adding more data, whether expert or non-expert still allows DT to gain the edge over Behavioral cloning+expert data. It also achieves super human level performance across 41 games, so catastrophic forgetting seems less relevant and perhaps alleviated by scaling alone... I hope the next paper explores MoEs, they've been quite underappreciated lately. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 1 min )
    [D] Do you need to grind leetcode for junior or mid level AI/ML Engineer interviews?
    Are AI/ML interviews centred on DSA problem solving found on the well-known websites without much emphasis on specific AI/ML theory or would they feature Pytorch/TensorFlow model creation/training problems? How much emphasis is placed on relevant experience and past projects? submitted by /u/redxammer [link] [comments]  ( 4 min )
    [D] How to measure metric consistency/stability
    Suppose I have some model, a metric and a test set. Now I get some metric value for the whole set. I'm interested to understand the metric distribution across different "areas" of the test set. For example, suppose I have a sentiment analysis model (Negative, Neutral, Positive) and I use accuracy for the metric (suppose the classes are balanced). Also I have a test set of 2 years of twitter posts (2021-2022). Now I may get overall accuracy of 80%, but If look inside, it maybe the case that I have 70% during 2021 and 90% during 2022. Or I may have 80% accuracy overall, but for "politics" posts I have accuracy of 60%. So I want to measure the noise of my metric across some dimensions (not random). Is there any standard way to do so? Or should I just break my data into different dimensions and look at the mean and std? ​ Thanks submitted by /u/sudo_su_ [link] [comments]  ( 1 min )
    [R] BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis
    https://arxiv.org/abs/2205.14807 submitted by /u/scallion000 [link] [comments]  ( 1 min )
    [Project] Predicting customer purchase. One or multiple models?
    Hi all, I am looking for your inputs for a current project in my team within ML/DS. Would appreciate your answers! TLDR; Questions at the bottom. Trying to develop a model. Is it better to have one model for each segment, or one to rule them all? In my team, we are trying to predict whether a customer will buy a device in the next month. We call it the propensity to buy model. At first, we were passing all the available features that are from 1st of the month, as input to the model, with the label being a binary variable indicating whether someone made a purchase anytime between the 1st and 31st. Features include demographics (which I think are useless to be honest), tenure, current subscribed product, etc. Most of which are static, while a a few are dynamic each month. This mo…  ( 4 min )
    [D] Why arxiv-sanity isn't working?
    arxiv-sanity-lite.com (ASL) doesn't even provide the required paper on first or second page. 'Similar' option at ASL works good but 'show similar' on arxiv-sanity is way better. Connectedpapers, zeta-alpha, and research rabbit are other options, but still I like arxiv-sanity. submitted by /u/SAbdusSamad [link] [comments]
    [D] How could one show that a modification to a model yields better convergence?
    Hi, I am thinking about showing that a modification I made to a GAN network yields to better convergence of the training. How can I show this? Is there some path where I can show this mathematically or experimentally? What is your opinion about this? submitted by /u/SeucheAchat9115 [link] [comments]  ( 1 min )
    [D]How to limit the label value while avoiding introducing bias?
    I have a deep neural net model with an integer label to predict. The label is heavily skewed so we cap the labels at some value (let's say 90%ile). Now when we build and run the model, it performs well in general. But in online experiment shows degradation in business metrics for a fraction of clients that have their value capped. If we don't cap the label, the business metrics gets skewed for users with low number of activities. What are my best options to deal with such issue? Adding a new feature? Multi tower learning? Any idea can be super helpful. Thanks. submitted by /u/Which-Distance1384 [link] [comments]  ( 1 min )
    [R] Detecting danger in gridworlds using Gromov's Link Condition
    submitted by /u/tfburns [link] [comments]  ( 2 min )
    [R] Question: Say, I have a sparse vector and I want to reduce the vector to a few descriptive values. What values would you choose?
    I have tried min, max, mean and standard deviation, and sum. What other descriptive statistics would you choose? Would be very, very grateful if your answer is detailed or cited with detailed references/papers. Many thanks. submitted by /u/lal-mohan [link] [comments]  ( 1 min )
    [D] Neural theorem prover and logical reasoning connection
    I have seen works based on Neural theorem prover like End-to-End Differentiable Proving . I am unable to grasp the idea behind theorem proving for logical reasoning task. How can proving a theorem can help us in solving a problem like- Given a situation and rules, what to do to achieve the goal. This paper tried to address this type logical reasoning combining neural network with symbolic reasoning. But I feel lost while reading the approach, which view the whole problem setup as theorem prover or proof generation. If you have an idea about theorem prover and what is its use case in logical reasoning. submitted by /u/projekt_treadstone [link] [comments]  ( 1 min )
    [P] A newsletter for bite-size content about ML/NLP
    We have created a newsletter MLnotes that provides bite-size content about Machine Learning and NLP tips, interviews, and applications across various industries. The goal is not create very long tutorial-like content, and make it fairly short. Therefore, it's good if you're trying to study or refresh your knowledge about ML/NLP. Even if you have no background you'll get the clues to get started. I'd appreciate it if you have any comments or suggestions. submitted by /u/ma1ms [link] [comments]  ( 1 min )
  • Open

    "Towards Learning Universal Hyperparameter Optimizers with Transformers", Chen et al 2022 {G} (Decision Transformer?)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Multi-Agent Reinforcement Learning is a Sequence Modeling Problem", Wen et al 2022 (Decision Transformer for MARL: interleave agent choices)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How do you stay up to date in Reinforcement Learning research?
    Besides following the right companies/people on Twitter and this subreddit, how do you people stay up to date on what is going on Deep/Reinforcement Learning research? What journals to follow, what conferences to attend? I'll leave here a few options, but I would like to know more. - Twitter (for general news, not much for discussions): DeepMind, OpenAI, Hugging Face, Yann LeCunn, Ian Goodfellow, François Chollet, Fei-Fei Li, Andrej Karpathy... - Conferences: ICLR,NeurIPS, ICML, IEEE SaTML, AAAI, AISTATS, AAMAS, COLT... - Eventualy search your favorite researchers/topics on arXiv.org Any podcasts or anything else? submitted by /u/TheKeyZero [link] [comments]  ( 1 min )
    Is it possible to design a multiplicative/exponential reward function? A reward func that depends on current accumulated reward?
    Hey everyone, In the context of my problem, the "true" reward is not additive. Realistically, the more reward the agent has already accumulated, the easier it becomes to accumulate even more. That's to say, the real reward function is partially dependent on previously accumulated reward. Is there any way to implement this kind of dynamic successfully? I have tried to, but for some reason, the agent completely stops learning when I do this. I can implement a linear/additive reward function and the agent does learn good behaviors, but I feel that it's important for the agent to "understand" the true reward dynamic. Essentially, here is the reward function I have: reward = points_gained_this_step And here's the kind of reward that I want (because it actually fully represents the problem): reward = points_gained_this_step*(total_score_so_far) total_score_so_far = total_score_so_far + reward Does anyone have experience implementing something like this successfully? Any advice would be greatly appreciated. EDIT: Do vanishing/exploding gradients have anything to do with it? (Given the exponential growth/decay nature of the rewards in this case) EDIT 2: The "total_score_so_far" is already in my observation space submitted by /u/VladimirB-98 [link] [comments]  ( 2 min )
    "Multi-Game Decision Transformers", Lee et al 2022 {G} (ALE Decision Transformer/Gato: near-human offline single-agent w/scaling & rapid transfer)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How to generate coordinates in 3D space for tracking problem in drones?
    Hi, I am working on combining deep rl to model predictive control for safe RL in Quadrotors. I am using DDPG method to train a quadrotor for tracking problem. I have many types of reward functions but in each one of them, the drones behave erratically during test time. I think it can be due to the random 3D points I generated while training the agent. Some of the research paper used proper trajectory data (like circular, spiral trajectories) for their tracking problem. I am really confused on how to approach this problem. Because If I use only one trajectory, then the agent will not be able to track other trajectories and also will not be robust to disturbances. Thus, I wanted to know what you think about solution to this problem? How do I generate 3D points of different types of trajectories? (Python code for generating points will also be very helpful). submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    PhD student looking to identify a research topic in RL for controls applications .
    Hello I have been reading through quite a few papers/topics discussing model free vs model based RL etc . Not been able to find something , may be I don't understand it yet to the extent :) . Just for the background : My experience is with Diesel , SI engines , vehicles and controls . One of the topics/areas that seems interesting to me is learning using RL in uncertain scenarios, this might seem to broad for most of the people . Another possible area would be RL for connected vehicles, self driving etc . Any help/suggestion is welcome . submitted by /u/SeasonedLeo [link] [comments]  ( 2 min )
    Question about reward function evaluation
    Is the paper QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS enough to learn about how to evaluate reward function, cuz evaluate reward function is kinda new to me? submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    Best maze RL-Solver
    Hey guys, I am looking for the best performing RL algorithm for solving mazes. On top of my head, I'd go with Q-learning or SARSA(lambda). What do you guys think? I don't really find literature specifically on that. maze: simple 2d deterministic grid world submitted by /u/Kartoffelman98 [link] [comments]  ( 1 min )
    How Good is Udacity Deep Learning Nanodegree?
    submitted by /u/MlTut [link] [comments]  ( 1 min )
    SOTA of RL in precise motion control of robot
    Hi, when training an agent and evaluating the trained agent, I have realized that the agent tends to show slightly different behavior/performance even if the goal remains the same. I believe this is due to the stochastic nature of RL. But, how can this agent be then transferred to the reality, when the goal lies for example in the precise control of a robot? Are you aware of any RL work that deals with the real robot for precise motion controlling? (for instance, precisely placing the robot's tool at the goal position) submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Inverse Reinforcement Learning: Continuous Action-State Space
    I am pretty new to Reinforcement Learning. I am currently working on a project where I need to find the reward function of gamblers' behavior. I have a dataset of each player playing on a casino machine which stores each transaction performed by the user and related information. I can have one action as how much the user has bid, and for the state, I was thinking to have transaction time and the amount in the machine on each transaction. I couldn't find many implementations on the internet about a related problem. I was planning to use maxEntropy IRL or PL-IRL, but have no clue on how to implement these algorithms for my problem. Could someone help me with how I should approach this problem or suggest a way in which I can implement these models? submitted by /u/gjariwala9 [link] [comments]  ( 1 min )
  • Open

    Seamlessly connect Amazon Athena with Amazon Lookout for Metrics to detect anomalies
    Amazon Lookout for Metrics is an AWS service that uses machine learning (ML) to automatically monitor the metrics that are most important to businesses with greater speed and accuracy. The service also makes it easier to diagnose the root cause of anomalies, such as unexpected dips in revenue, high rates of abandoned shopping carts, spikes […]  ( 7 min )
  • Open

    Introducing the Data Product Development Canvas (Version 1.0)
    I think one of the most important data consumption developments in the age of Big Data and AI is the concept of a Data Product. Data Products are a category of domain-infused, AI/ML-powered apps designed to help non-technical users manage data and analytics-intensive operations to achieve specific, meaningful, and relevant business outcomes. Some key aspects… Read More »Introducing the Data Product Development Canvas (Version 1.0) The post Introducing the Data Product Development Canvas (Version 1.0) appeared first on Data Science Central.  ( 6 min )
  • Open

    The Closer: Machine Learning Helps Banks, Buyers Finalize Real Estate Transactions
    The home-buying process can feel like an obstacle course — finding the perfect place, putting together an offer and, the biggest hurdle of all, securing a mortgage. San Francisco-based real-estate technology company Doma is helping prospective homeowners clear that hurdle more quickly with the support of AI. Its machine learning models accelerate properties through the Read article > The post The Closer: Machine Learning Helps Banks, Buyers Finalize Real Estate Transactions appeared first on NVIDIA Blog.  ( 3 min )
    Fantastical 3D Creatures Roar to Life ‘In the NVIDIA Studio’ With Artist Massimo Righi
    The year of the tiger comes into focus this week In the NVIDIA Studio, which welcomes 3D creature artist Massimo Righi. An award-winning 3D artist with two decades of experience in the film industry, Righi has received multiple artist-of-the-month accolades and features in top creative publications. The post Fantastical 3D Creatures Roar to Life ‘In the NVIDIA Studio’ With Artist Massimo Righi appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    DoWhy evolves to independent PyWhy model to help causal inference grow
    Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause-and-effect are also critical for the design and data-driven evaluation of many technological systems we build today.  To help data scientists better understand and […] The post DoWhy evolves to independent PyWhy model to help causal inference grow appeared first on Microsoft Research.  ( 7 min )
  • Open

    How does AI Help Doctors, Physicians, and Healthcare Setups?
    Despite the more vocal impacts, Artificial Intelligence hasn’t had a seamless entrance across verticals. Yet, AI-based resources in…  ( 3 min )
    The Mind of Machine and Men
    How“Can’t Help Myself” helps us understand consciousness and empathy Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 6 min )
    The Many Implications of Human Cloning
    “Clones are organisms that are exact genetic copies of any living organism. Every single bit of their DNA is identical.”  ( 5 min )

  • Open

    How do I save ram usage using the Breakout dataset from gin
    I tried to import the episodes i needed (1000000 stacks of 4 84x84 grayscale images) using d4rl-atari, but that destroys the RAM limitations I have (16GB). Has anyone worked with this dataset that knows how I can save on RAM? submitted by /u/PM_ME_FREE_GAMES [link] [comments]  ( 1 min )
    "Multitasking Inhibits Semantic Drift", Jacob et al 2021
    submitted by /u/gwern [link] [comments]
    Why do I love Reinforcement Learning?
    Just a generic read, please leave your thoughts/feedback. https://medium.com/@vishalgarg652/why-do-i-love-reinforcement-learning-5c0de2abf7e4 submitted by /u/vishalgarg652 [link] [comments]  ( 1 min )
  • Open

    Last Week in AI: Royal Mail will deliver with drones, backdoors in deep learning, AI for weather forecasts, and more!
    submitted by /u/regalalgorithm [link] [comments]
    The Ultimate Disco Diffusion Tutorial ... EVERY Feature ... EXPLAINED
    submitted by /u/JoshGrambo [link] [comments]
    AI Dream 53 - Cosmic Birth | MASTERPIECE FINAL RAW
    submitted by /u/LordPewPew777 [link] [comments]
    The Limits of Automation
    submitted by /u/glaringconstraint [link] [comments]
    Can self learning ai software be created as a hardware emulator that automatically replicates any detected computer hardware?
    I was just thinking if its possible to have a server with self learning ai software that functions like a hardware emulator. The ai would automatically detected any hardware, replicate it, save the information, and store it fore later use. submitted by /u/PlankOfWoood [link] [comments]  ( 1 min )
    No, GPT-3 Cannot Click On Links
    submitted by /u/drcopus [link] [comments]  ( 1 min )
    Fractal - 4K AI Art Visualization
    submitted by /u/MLInsights [link] [comments]
    Learning path to become an AI engineer?
    From my understanding the learning path should be: Data Engineer (good knowledge of SQL, Understanding of ETL, Big Data Analytics Tools such as Cassandra, Hive, Apache Spark) Data Analyst (Exploratory data Analysis - PowerBI) Data Scientist / AI engineer (ML algos , DL algos, business use cases for NLP, CV). Where I stand today: I have a bachelor's degree in comp. science. and decent knowledge of Web Development (HTML, CSS, JS), Cloud computing (AWS) and DevOps (Git, Docker, Kubernetes). Pertaining to skills required for Data Science & AI - Math: I am good at it SQL: I have decent enough understanding of SQL. MSSQL and MySQL, but I haven't tried Big Data tools and also lack interest in those PowerBI: I am good at PowerBI - DAX, nice layouts, good design sense but not ve…  ( 2 min )
    Hugging Face Endpoints on Azure | Rubik's Code
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Will you let artificial intelligence take your job?
    Direct answers only. View Poll submitted by /u/trillswan [link] [comments]  ( 1 min )
    Great video...The AI revolution is on its way...new industrial revolution coming soon...
    submitted by /u/the_anonymizer [link] [comments]
    Tesseract - 4K Neural Art Visualization
    submitted by /u/MLInsights [link] [comments]
  • Open

    A step forward in CIKM 2022 review process [D]
    As the title says, the email to reviewers explicitly states that there are no preconceived limits on the "acceptance rate", please do a good job reviewing. How this statement stands up for the final decisions, who knows, but it is a good step forward. submitted by /u/not_novel_enough [link] [comments]  ( 1 min )
    [D] Experience with Vertex AI?
    Have you used Google's Vertex AI for model deployment and management in production? If so, what is your experience with it for maintaining ML pipelines? We are considering it at my work and was curious to know about the difficulties and challenges. Feel free to share your thoughts on other frameworks/tools as well. Thanks in advance. submitted by /u/zxqkv [link] [comments]  ( 1 min )
    "[Discussion]", "[D]" New Data Science Interview Reddit Community
    Hi All, I started a new Reddit community for people to share data science interview experiences and tips at https://www.reddit.com/r/DataScienceInterview/. submitted by /u/PythonDataScientist [link] [comments]
    [D] Datasets and Models for Structured Information Extraction on HTML
    I like the idea of basically summarizing an HTML to the markup schema one is interested in. So I recently stumbled upon the WebFormer paper (https://arxiv.org/pdf/2202.00217v1.pdf) and wanted to try it out for different categories of structured information. As often, neither the weights nor the code is available. When it comes to datasets, it seems like there are not many options. The SWDE dataset used in the paper is not available anymore. I wasn't able to find any other appropriate dataset with HTML to Markup. Web Data Commons provides a subset of Common Crawl with sites that have Markup data, but I still need to investigate this further. When it comes to models, I would love to start with some pre-trained models as a baseline as I want to investigate how little data I can start with to see any reasonable results with fine-tuning. Now the question is, what model would be best for HTML based tokens? I would love to hear your suggestions for models and datasets to approach this problem. submitted by /u/theamaru [link] [comments]  ( 1 min )
    [D] Finding an optimal threshold in multi-class classification problem
    For binary class problems, finding a probability threshold would include trying different threshold values and using the threshold that yields the highest accuracy/precision/recall depending on the most relevant metric for the problem. Considering multi-class classification algorithms would output N probabilities does it make sense to have a common threshold for all classes or to treat each class individually and get N different thresholds? If so how does this scale with a high number of classes? Are there better techniques than these? Few pointers about the use case: Number of classes >10000. Cost of misclassification of all classes is equal. I want to also be able to output "unidentified" if no class probability crosses the respective threshold. submitted by /u/BuddhaSadhu666 [link] [comments]  ( 2 min )
    [D] Comparison of TPU vs GPU for fast & cheap inference in 2022?
    I'm trying to understand how TPUs and GPUs compare for inference (not training!), in terms of (a) financial cost and (b) speed. Does anyone know the answer, or could anyone point me towards some blog post with the answer? Many of the resources I've found are sadly 2-4 years out of date, and I'd ideally like a more recent, authoritative answer. submitted by /u/RSchaeffer [link] [comments]  ( 1 min )
    [P] Dataset for online news discussions summarization
    I'm currently working on a project on abstractive summarization of online news discussion. I have found in the literature two relative datasets for this task: A subset of the conversation benchmark dataset from ConvoSumm (2021), and SENSEI. Both include comments relating to a news article, as well as a human curated summary of the form "Some commenters noted that.. One commenter argued..". Since both contain a limited amount of examples (500 and 18 respectively), I was curious if you were aware of a similar available dataset in literature. Moreover, I'm tackling this task from a multi-document summarization perspective, so any datasets from that area that could serve as an auxiliary learning task would be welcome. Thank you in advance! submitted by /u/Kounelly94 [link] [comments]  ( 1 min )
    [P] Generate cryptographic encodings using GANs
    Hi guys, I was working on a cryptographic problem for which I need to make a GAN learn how to generate RSA encodings. Example input: b'GxHqH5c3fB/TEQispLvYByl5iPmiYLFq2ZyDQqfNbOpt4UOUWOOI4ZyAd0dWHSdTBhGmf8Psa9Ivo6tEbtAu1BZKIE1m1FwGL38FO6HgmTQnpeoJTqPaadq9wdax7TF3XZJFBeqNRChIkePcEt3yComXEKA8gOY2FnlSFo8jTRY=' Example output: b'CRMV1iuaCpuQgxctqob1CIoUbm3twa85Ahi8HKrm7gZjXS3NA5rW60/XNn5uWZ1OAUBXQAIn3e7m3s1Bs6kOV7MjQyqcf5M8KPx+TtifUbNgO8ENUboCqOXOv3/aLwaAbNvRaCucOox6sh70rWXKUzVD6jKZZ7sBWC2VDnklO7Y=' My main goal is to make the model overfit the training data as well as it should be able to generate new encodings for unseen data. Is this thing even possible? If yes what kind of GANs should I use. Also, what would be a good generator and discriminator network in this case? Any help would be appreciated. Thanks. submitted by /u/NoAct7818 [link] [comments]  ( 2 min )
    [R] LayoutLM Word-patch alignment pre-training
    Hi, So LayoutLm V3 was pretained to predict, for each unmasked text token, if the corresponding visual tokens were masked or not. But they exclude the masked text token for this task Anyone understand why they do that ? Thanks in advance submitted by /u/Meddhouib10 [link] [comments]  ( 1 min )
    [D] What do you value in a paper replication?
    Context: Recently read a paper from a few years back that I thought was pretty cool. Ended up replicating the implementation on github, because (1) I believe the idea should be made more accessible, and (2) as good old fashioned practice. Throughout the time spent working on it, replicating training results was dead last in priority, and I nearly forgot about it before considering the exercise complete. Thus my curiosity: r/MachineLearning, what do you value in a paper replication? P.S.: Might as well link the repo while I'm here. Happy to hear any feedback! submitted by /u/coffee869 [link] [comments]  ( 3 min )
    [D] Build, train and track your ML models with a few lines of code
    Today, Layer goes open-source to make machine learning more accessible and contribute to ML's growth and evolution. Machine Learning is becoming the default way to build technology. It's how you make your apps smarter, your systems more reliable and your businesses smarter. This is mostly possible by the open-science efforts; from open-source ML frameworks to open datasets. We will open-source more including our roadmap. Meanwhile, check out our repo, and don't forget to give us a star! https://github.com/layerai/sdk submitted by /u/mwitiderrick [link] [comments]  ( 1 min )
    [R] Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power
    submitted by /u/hardmaru [link] [comments]  ( 1 min )
    [D] Maintaining documentation with live results from experiments
    Machine Learning projects are heavily data-driven, so documentation needs to contain the outcome of the experiments as well. While I stumbled upon many tools to keep track of experiments in ML (e.g., MLFlow, W&B, etc.), I couldn't find anything to keep documentation of all the experiments with results without having to manually copy-paste. Ideally, I should be able to maintain a live page with results that the entire team can see. What do you all use for documentation purposes? submitted by /u/mighty-dude [link] [comments]  ( 1 min )
    [D] using git or other tools to manage models
    Trying to understand what's the best way to keep track of all the models I'm generating. Playing with git lfs but it seems complicated and not always a perfect match for the problem. I suspect at larger companies there's database support for this, probably associated with an S3 type data store. Looking for ideas about how to do this as an individual developer without too much trouble. submitted by /u/danbmil99 [link] [comments]  ( 1 min )
  • Open

    Conspicuously missing data
    I was working on a report for a client this afternoon when I remembered this comic from Spiked Math. I needed to illustrate the point that revealing information about one person or group can reveal information on other people or other groups. If you give your genetic information to a company, for example, you also […] Conspicuously missing data first appeared on John D. Cook.  ( 1 min )
  • Open

    How Big Data Can Transform Talent Management
    The world is facing a reskilling emergency. World Economic Forum highlights that over 54% of people will need to reskill and upskill themselves in the next 3 years. Technological advances are upending how work is done. As the world of work changes, the role of the human resource department will no longer be limited to… Read More »How Big Data Can Transform Talent Management The post How Big Data Can Transform Talent Management appeared first on Data Science Central.  ( 4 min )
    How IoT Technology Is Improving the Future of Transportation
    IoT stands for ‘Internet of things’ and can be explained as the network of interconnected devices that share data over the communication networks resembling the spider’s web. Everyday objects like kitchen appliances, cars, fitness devices, and smartwatches are connected to the internet by embedded devices that provide smooth communication between people, things, the environment, and… Read More »How IoT Technology Is Improving the Future of Transportation The post How IoT Technology Is Improving the Future of Transportation appeared first on Data Science Central.  ( 4 min )
    Countering Data Tech’s Cheap Speech
    Law professor Eugene Volokh was apparently the first person to popularize the term cheap speech, in an article for Yale Law Review in 1995. Recently, Law professor Rick Hasen has been promoting a new book of his own titled Cheap Speech. His definition of cheap speech expands on Volokh’s definition. Quoting directly here from a… Read More »Countering Data Tech’s Cheap Speech The post Countering Data Tech’s Cheap Speech appeared first on Data Science Central.  ( 4 min )
    Could ABBAtars be the business model for the metaverse and 5G?
    Last week, the 80s pop group ABBA performed a ‘hologram concert’ based on what they called as ‘ABBAtars’ By all measures in the media, it was very successful From a technological perspective, could it offer a ‘killer app’ for 5G and the Metaverse? Firstly, a hologram concert is not a hologram as we know it… Read More »Could ABBAtars be the business model for the metaverse and 5G? The post Could ABBAtars be the business model for the metaverse and 5G? appeared first on Data Science Central.  ( 2 min )
    Does Your Business Require AI For Dedicated Internet Access?
    Internet connectivity has few available options. So, the choice comes down to a dedicated internet connection or broadband. The choice depends on security, cost, and performance.  The speed and quality of broadband depend on the traffic that goes through the network at a given time. So, dedicated internet is the solution for businesses that need… Read More »Does Your Business Require AI For Dedicated Internet Access? The post Does Your Business Require AI For Dedicated Internet Access? appeared first on Data Science Central.  ( 5 min )
    How to Protect Your Cloud from Cyberattacks During and After Migration
    To reap all the benefits of cloud computing technology, it’s important to secure the cloud during and after migration. The post How to Protect Your Cloud from Cyberattacks During and After Migration appeared first on Data Science Central.  ( 5 min )
    How 3 Key Ecommerce Metrics Can Inform Your Data Analysis
    Ecommerce is a cutthroat industry, and it’s only getting more competitive. The post How 3 Key Ecommerce Metrics Can Inform Your Data Analysis appeared first on Data Science Central.  ( 5 min )
    The Technological Arms Race of Software Licensing
    Software licensing exists in the space between digital services and the real world. Often, a license will denote the number of active users permitted to use the software, as well as their ability to manipulate the source code given. As such, it is a contract often enforced by the software itself, with checks to ensure… Read More »The Technological Arms Race of Software Licensing The post The Technological Arms Race of Software Licensing appeared first on Data Science Central.  ( 4 min )
  • Open

    NVIDIA Accelerates AI, Digital Twins, Quantum Computing and Edge HPC at ISC 2022
    Researchers grappling with today’s grand challenges are getting traction with accelerated computing, as showcased at ISC, Europe’s annual gathering of supercomputing experts. Some are building digital twins to simulate new energy sources. Some use AI+HPC to peer deep into the human brain. Others are taking HPC to the edge with highly sensitive instruments or accelerating Read article > The post NVIDIA Accelerates AI, Digital Twins, Quantum Computing and Edge HPC at ISC 2022 appeared first on NVIDIA Blog.  ( 4 min )
    The Man With 100,000 Brains: AI’s Big Donation to Science
    Jorge Cardoso wears many hats, and that’s appropriate given he has so many brains. A hundred thousand of them to be exact. Cardoso is a teacher, a CTO, an entrepreneur, a founding member of the MONAI open source consortium and a researcher in AI for medical imaging. In that last role, Cardoso and his team Read article > The post The Man With 100,000 Brains: AI’s Big Donation to Science appeared first on NVIDIA Blog.  ( 3 min )
    The Road to the Hybrid Quantum-HPC Data Center Starts Here
    It’s time to start building tomorrow’s hybrid quantum computers. The motivation is compelling, the path is clear and key components for the job are available today. Quantum computing has the potential to bust through some of today’s toughest challenges, advancing everything from drug discovery to weather forecasting. In short, quantum computing will play a huge Read article > The post The Road to the Hybrid Quantum-HPC Data Center Starts Here appeared first on NVIDIA Blog.  ( 4 min )
    Scientists Building Digital Twins in NVIDIA Omniverse to Accelerate Clean Energy Research
    As global climate change accelerates, finding and securing clean energy is a crucial challenge for many researchers, organizations and governments. The U.K.’s Atomic Energy Authority (UKAEA), through an evaluation project at the University of Manchester, has been testing the NVIDIA Omniverse simulation platform to accelerate the design and development of a full-scale fusion powerplant that Read article > The post Scientists Building Digital Twins in NVIDIA Omniverse to Accelerate Clean Energy Research appeared first on NVIDIA Blog.  ( 4 min )
    HPC Researchers Seed the Future of In-Network Computing With NVIDIA BlueField DPUs
    Across Europe and the U.S., HPC developers are supercharging supercomputers with the power of Arm cores and accelerators inside NVIDIA BlueField-2 DPUs. At Los Alamos National Laboratory (LANL) that work is one part of a broad, multiyear collaboration with NVIDIA that targets 30x speedups in computational multi-physics applications. LANL researchers foresee significant performance gains using Read article > The post HPC Researchers Seed the Future of In-Network Computing With NVIDIA BlueField DPUs appeared first on NVIDIA Blog.  ( 4 min )
    Hyperscale Digital Twins to Give Us “Amazing Superpowers,” NVIDIA Exec Says at ISC 2022
    Highly accurate digital representations of physical objects or systems, or “digital twins,” will enable the next era of industrial virtualization and AI, executives from NVIDIA and BMW said Tuesday. Kicking off the ISC 2022 conference in Hamburg, Germany, NVIDIA’s Rev Lebaredian (left), vice president for Omniverse and simulation technology, was joined by Michele Melchiorre, senior Read article > The post Hyperscale Digital Twins to Give Us “Amazing Superpowers,” NVIDIA Exec Says at ISC 2022 appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    A.I. Plays Would You Rather
    submitted by /u/BasicallyJustASpider [link] [comments]
    Hugging Face Endpoints on Azure
    submitted by /u/RubiksCodeNMZ [link] [comments]
  • Open

    Approximating the Manifold Structure of Attributed Incentive Salience from Large Scale Behavioural Data. A Representation Learning Approach Based on Artificial Neural Networks. (arXiv:2108.01724v2 [cs.LG] UPDATED)
    Incentive salience attribution can be understood as a psychobiological mechanism ascribing relevance to potentially rewarding objects and actions. Despite being an important component of the motivational process guiding our everyday behaviour its study in naturalistic contexts is not straightforward. Here we propose a methodology based on artificial neural networks (ANNs) for approximating latent states produced by this process in situations where large volumes of behavioural data are available but no experimental control is possible. Leveraging knowledge derived from theoretical and computational accounts of incentive salience attribution we designed an ANN for estimating duration and intensity of future interactions between individuals and a series of video games in a large-scale ($N> 3 \times 10^6$) longitudinal dataset. We found video games to be the ideal context for developing such methodology due to their reliance on reward mechanics and their ability to provide ecologically robust behavioural measures at scale. When compared to competing approaches our methodology produces representations that are better suited for predicting the intensity future behaviour and approximating some functional properties of attributed incentive salience. We discuss our findings with reference to the adopted theoretical and computational frameworks and suggest how our methodology could be an initial step for estimating attributed incentive salience in large scale behavioural studies.  ( 2 min )
    Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks. (arXiv:2109.08844v2 [stat.ML] UPDATED)
    We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.  ( 2 min )
    Characterizing the robustness of Bayesian adaptive experimental designs to active learning bias. (arXiv:2205.13698v1 [stat.ME])
    Bayesian adaptive experimental design is a form of active learning, which chooses samples to maximize the information they give about uncertain parameters. Prior work has shown that other forms of active learning can suffer from active learning bias, where unrepresentative sampling leads to inconsistent parameter estimates. We show that active learning bias can also afflict Bayesian adaptive experimental design, depending on model misspecification. We develop an information-theoretic measure of misspecification, and show that worse misspecification implies more severe active learning bias. At the same time, model classes incorporating more "noise" - i.e., specifying higher inherent variance in observations - suffer less from active learning bias, because their predictive distributions are likely to overlap more with the true distribution. Finally, we show how these insights apply to a (simulated) preference learning experiment.  ( 2 min )
    Towards a Unified Framework for Uncertainty-aware Nonlinear Variable Selection with Theoretical Guarantees. (arXiv:2204.07293v2 [stat.ML] UPDATED)
    We develop a simple and unified framework for nonlinear variable selection that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc). In particular, for a learned nonlinear model $f(\mathbf{x})$, we consider quantifying the importance of an input variable $\mathbf{x}^j$ using the integrated partial derivative $\Psi_j = \Vert \frac{\partial}{\partial \mathbf{x}^j} f(\mathbf{x})\Vert^2_{P_\mathcal{X}}$. We then (1) provide a principled approach for quantifying variable selection uncertainty by deriving its posterior distribution, and (2) show that the approach is generalizable even to non-differentiable models such as tree ensembles. Rigorous Bayesian nonparametric theorems are derived to guarantee the posterior consistency and asymptotic uncertainty of the proposed approach. Extensive simulations and experiments on healthcare benchmark datasets confirm that the proposed algorithm outperforms existing classic and recent variable selection methods.  ( 2 min )
    Minimax Regret for Cascading Bandits. (arXiv:2203.12577v2 [cs.LG] UPDATED)
    Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.  ( 2 min )
    Benign Overparameterization in Membership Inference with Early Stopping. (arXiv:2205.14055v1 [cs.LG])
    Does a neural network's privacy have to be at odds with its accuracy? In this work, we study the effects the number of training epochs and parameters have on a neural network's vulnerability to membership inference (MI) attacks, which aim to extract potentially private information about the training data. We first demonstrate how the number of training epochs and parameters individually induce a privacy-utility trade-off: more of either improves generalization performance at the expense of lower privacy. However, remarkably, we also show that jointly tuning both can eliminate this privacy-utility trade-off. Specifically, with careful tuning of the number of training epochs, more overparameterization can increase model privacy for fixed generalization error. To better understand these phenomena theoretically, we develop a powerful new leave-one-out analysis tool to study the asymptotic behavior of linear classifiers and apply it to characterize the sample-specific loss threshold MI attack in high-dimensional logistic regression. For practitioners, we introduce a low-overhead procedure to estimate MI risk and tune the number of training epochs to guard against MI attacks.  ( 2 min )
    How Tempering Fixes Data Augmentation in Bayesian Neural Networks. (arXiv:2205.13900v1 [cs.LG])
    While Bayesian neural networks (BNNs) provide a sound and principled alternative to standard neural networks, an artificial sharpening of the posterior usually needs to be applied to reach comparable performance. This is in stark contrast to theory, dictating that given an adequate prior and a well-specified model, the untempered Bayesian posterior should achieve optimal performance. Despite the community's extensive efforts, the observed gains in performance still remain disputed with several plausible causes pointing at its origin. While data augmentation has been empirically recognized as one of the main drivers of this effect, a theoretical account of its role, on the other hand, is largely missing. In this work we identify two interlaced factors concurrently influencing the strength of the cold posterior effect, namely the correlated nature of augmentations and the degree of invariance of the employed model to such transformations. By theoretically analyzing simplified settings, we prove that tempering implicitly reduces the misspecification arising from modeling augmentations as i.i.d. data. The temperature mimics the role of the effective sample size, reflecting the gain in information provided by the augmentations. We corroborate our theoretical findings with extensive empirical evaluations, scaling to realistic BNNs. By relying on the framework of group convolutions, we experiment with models of varying inherent degree of invariance, confirming its hypothesized relationship with the optimal temperature.  ( 2 min )
    Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power. (arXiv:2205.13863v1 [cs.LG])
    It is well-known that modern neural networks are vulnerable to adversarial examples. To mitigate this problem, a series of robust learning algorithms have been proposed. However, although the robust training error can be near zero via some methods, all existing algorithms lead to a high robust generalization error. In this paper, we provide a theoretical understanding of this puzzling phenomenon from the perspective of expressive power for deep neural networks. Specifically, for binary classification problems with well-separated data, we show that, for ReLU networks, while mild over-parameterization is sufficient for high robust training accuracy, there exists a constant robust generalization gap unless the size of the neural network is exponential in the data dimension $d$. Even if the data is linear separable, which means achieving low clean generalization error is easy, we can still prove an $\exp({\Omega}(d))$ lower bound for robust generalization. Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$). Nonetheless, we also have a lower bound that grows exponentially with respect to $k$ -- the curse of dimensionality is inevitable. By demonstrating an exponential separation between the network size for achieving low robust training and generalization error, our results reveal that the hardness of robust generalization may stem from the expressive power of practical models.
    Inference and Sampling for Archimax Copulas. (arXiv:2205.14025v1 [stat.ME])
    Understanding multivariate dependencies in both the bulk and the tails of a distribution is an important problem for many applications, such as ensuring algorithms are robust to observations that are infrequent but have devastating effects. Archimax copulas are a family of distributions endowed with a precise representation that allows simultaneous modeling of the bulk and the tails of a distribution. Rather than separating the two as is typically done in practice, incorporating additional information from the bulk may improve inference of the tails, where observations are limited. Building on the stochastic representation of Archimax copulas, we develop a non-parametric inference method and sampling algorithm. Our proposed methods, to the best of our knowledge, are the first that allow for highly flexible and scalable inference and sampling algorithms, enabling the increased use of Archimax copulas in practical settings. We experimentally compare to state-of-the-art density modeling techniques, and the results suggest that the proposed method effectively extrapolates to the tails while scaling to higher dimensional data. Our findings suggest that the proposed algorithms can be used in a variety of applications where understanding the interplay between the bulk and the tails of a distribution is necessary, such as healthcare and safety.
    Dual Convexified Convolutional Neural Networks. (arXiv:2205.14056v1 [cs.LG])
    We propose the framework of dual convexified convolutional neural networks (DCCNNs). In this framework, we first introduce a primal learning problem motivated from convexified convolutional neural networks (CCNNs), and then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates. Our approach reduces the memory overhead of constructing a large kernel matrix and eliminates the ambiguity of factorizing the matrix. Due to the low-rank structure in CCNNs and the related subdifferential of nuclear norms, there is no closed-form expression to recover the primal solution from the dual solution. To overcome this, we propose a highly novel weight recovery algorithm, which takes the dual solution and the kernel information as the input, and recovers the linear and convolutional weights of a CCNN. Furthermore, our recovery algorithm exploits the low-rank structure and imposes a small number of filters indirectly, which reduces the parameter size. As a result, DCCNNs inherit all the statistical benefits of CCNNs, while enjoying a more formal and efficient workflow.
    Global Convergence of Over-parameterized Deep Equilibrium Models. (arXiv:2205.13814v1 [cs.LG])
    A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection. Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation. The training dynamics of over-parameterized DEQs are investigated in this study. By supposing a condition on the initial equilibrium point, we show that the unique equilibrium point always exists during the training process, and the gradient descent is proved to converge to a globally optimal solution at a linear convergence rate for the quadratic loss function. In order to show that the required initial condition is satisfied via mild over-parameterization, we perform a fine-grained analysis on random DEQs. We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.  ( 2 min )
    Privacy of Noisy Stochastic Gradient Descent: More Iterations without More Privacy Loss. (arXiv:2205.13710v1 [cs.LG])
    A central issue in machine learning is how to train models on sensitive user data. Industry has widely adopted a simple algorithm: Stochastic Gradient Descent with noise (a.k.a. Stochastic Gradient Langevin Dynamics). However, foundational theoretical questions about this algorithm's privacy loss remain open -- even in the seemingly simple setting of smooth convex losses over a bounded domain. Our main result resolves these questions: for a large range of parameters, we characterize the differential privacy up to a constant factor. This result reveals that all previous analyses for this setting have the wrong qualitative behavior. Specifically, while previous privacy analyses increase ad infinitum in the number of iterations, we show that after a small burn-in period, running SGD longer leaks no further privacy. Our analysis departs completely from previous approaches based on fast mixing, instead using techniques based on optimal transport (namely, Privacy Amplification by Iteration) and the Sampled Gaussian Mechanism (namely, Privacy Amplification by Sampling). Our techniques readily extend to other settings, e.g., strongly convex losses, non-uniform stepsizes, arbitrary batch sizes, and random or cyclic choice of batches.
    A Sea of Words: An In-Depth Analysis of Anchors for Text Data. (arXiv:2205.13789v1 [stat.ML])
    Anchors [Ribeiro et al. (2018)] is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. We leverage this analysis to gain insights on the behavior of Anchors on simple models, including elementary if-then rules and linear classifiers.
    Auditing Differential Privacy in High Dimensions with the Kernel Quantum R\'enyi Divergence. (arXiv:2205.13941v1 [cs.LG])
    Differential privacy (DP) is the de facto standard for private data release and private machine learning. Auditing black-box DP algorithms and mechanisms to certify whether they satisfy a certain DP guarantee is challenging, especially in high dimension. We propose relaxations of differential privacy based on new divergences on probability distributions: the kernel R\'enyi divergence and its regularized version. We show that the regularized kernel R\'enyi divergence can be estimated from samples even in high dimensions, giving rise to auditing procedures for $\varepsilon$-DP, $(\varepsilon,\delta)$-DP and $(\alpha,\varepsilon)$-R\'enyi DP.
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets. (arXiv:2204.07615v2 [cs.LG] UPDATED)
    The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.  ( 2 min )
    Estimation of Optimal Dynamic Treatment Assignment Rules under Policy Constraints. (arXiv:2106.05031v3 [econ.EM] UPDATED)
    This paper studies statistical decisions for dynamic treatment assignment problems. Many policies involve dynamics in their treatment assignments where treatments are sequentially assigned to individuals across multiple stages and the effect of treatment at each stage is usually heterogeneous with respect to the prior treatments, past outcomes, and observed covariates. We consider estimating an optimal dynamic treatment rule that guides the optimal treatment assignment for each individual at each stage based on the individual's history. This paper proposes an empirical welfare maximization approach in a dynamic framework. The approach estimates the optimal dynamic treatment rule from panel data taken from an experimental or quasi-experimental study. The paper proposes two estimation methods: one solves the treatment assignment problem at each stage through backward induction, and the other solves the whole dynamic treatment assignment problem simultaneously across all stages. We derive finite-sample upper bounds on the worst-case average welfare-regrets for the proposed methods and show $n^{-1/2}$-minimax convergence rates. We also modify the simultaneous estimation method to incorporate intertemporal budget/capacity constraints.  ( 2 min )
    Learning with Stochastic Orders. (arXiv:2205.13684v1 [stat.ML])
    Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. Finally, we provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. The code is available at https://github.com/yair-schiff/stochastic-orders-ICMN  ( 2 min )
    Generative Archimedean Copulas. (arXiv:2102.11351v3 [cs.LG] CROSS LISTED)
    We propose a new generative modeling technique for learning multidimensional cumulative distribution functions (CDFs) in the form of copulas. Specifically, we consider certain classes of copulas known as Archimedean and hierarchical Archimedean copulas, popular for their parsimonious representation and ability to model different tail dependencies. We consider their representation as mixture models with Laplace transforms of latent random variables from generative neural networks. This alternative representation allows for computational efficiencies and easy sampling, especially in high dimensions. We describe multiple methods for optimizing the network parameters. Finally, we present empirical results that demonstrate the efficacy of our proposed method in learning multidimensional CDFs and its computational efficiency compared to existing methods.  ( 2 min )
    Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures. (arXiv:2205.13647v1 [cs.LG])
    This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.  ( 2 min )
    Hazard Gradient Penalty for Survival Analysis. (arXiv:2205.13717v1 [cs.LG])
    Survival analysis appears in various fields such as medicine, economics, engineering, and business. Recent studies showed that the Ordinary Differential Equation (ODE) modeling framework unifies many existing survival models while the framework is flexible and widely applicable. However, naively applying the ODE framework to survival analysis problems may model fiercely changing density function which may worsen the model's performance. Though we can apply L1 or L2 regularizers to the ODE model, their effect on the ODE modeling framework is barely known. In this paper, we propose hazard gradient penalty (HGP) to enhance the performance of a survival analysis model. Our method imposes constraints on local data points by regularizing the gradient of hazard function with respect to the data point. Our method applies to any survival analysis model including the ODE modeling framework and is easy to implement. We theoretically show that our method is related to minimizing the KL divergence between the density function at a data point and that of the neighborhood points. Experimental results on three public benchmarks show that our approach outperforms other regularization methods.  ( 2 min )
    Fast variable selection makes scalable Gaussian process BSS-ANOVA a speedy and accurate choice for tabular and time series regression. (arXiv:2205.13676v1 [cs.LG])
    Gaussian processes (GPs) are non-parametric regression engines with a long history. They are often overlooked in modern machine learning contexts because of scalability issues: regression for traditional GP kernels are $\mathcal{O}(N^3)$ where $N$ is the size of the dataset. One of a number of scalable GP approaches is the Karhunen-Lo\'eve (KL) decomposed kernel BSS-ANOVA, developed in 2009. It is $\mathcal{O}(NP)$ in training and $\mathcal{O}(P)$ per point in prediction, where $P$ is the number of terms in the ANOVA / KL expansion. A new method of forward variable selection, quickly and effectively limits the number of terms, yielding a method with competitive accuracies, training and inference times for large tabular datasets. The new algorithm balances model fidelity with model complexity using Bayesian and Akaike information criteria (BIC/AIC). The inference speed and accuracy makes the method especially useful for modeling dynamic systems in a model-free manner, by modeling the derivative in a dynamic system as a static problem, then integrating the learned dynamics using a high-order scheme. The methods are demonstrated on a `Susceptible, Infected, Recovered' (SIR) toy problem, with the transmissibility used as forcing function, along with the `Cascaded Tanks' benchmark dataset. Comparisons on the static prediction of derivatives are made with a Random Forest and Residual Neural Network, while for the timeseries prediction comparisons are made with LSTM and GRU recurrent neural networks. The GP outperforms the other methods in all modeling tasks on accuracy, while (in the case of the neural networks) performing many orders of magnitude fewer calculations. For the SIR test, which involved prediction for a set of forcing functions qualitatively different from those appearing in the training set, the GP captured the correct dynamics while the neural networks failed to do so.  ( 3 min )
    Learning to Control Linear Systems can be Hard. (arXiv:2205.14035v1 [cs.LG])
    In this paper, we study the statistical difficulty of learning to control linear systems. We focus on two standard benchmarks, the sample complexity of stabilization, and the regret of the online learning of the Linear Quadratic Regulator (LQR). Prior results state that the statistical difficulty for both benchmarks scales polynomially with the system state dimension up to system-theoretic quantities. However, this does not reveal the whole picture. By utilizing minimax lower bounds for both benchmarks, we prove that there exist non-trivial classes of systems for which learning complexity scales dramatically, i.e. exponentially, with the system dimension. This situation arises in the case of underactuated systems, i.e. systems with fewer inputs than states. Such systems are structurally difficult to control and their system theoretic quantities can scale exponentially with the system dimension dominating learning complexity. Under some additional structural assumptions (bounding systems away from uncontrollability), we provide qualitatively matching upper bounds. We prove that learning complexity can be at most exponential with the controllability index of the system, that is the degree of underactuation.  ( 2 min )
    Unequal Covariance Awareness for Fisher Discriminant Analysis and Its Variants in Classification. (arXiv:2205.13565v1 [cs.LG])
    Fisher Discriminant Analysis (FDA) is one of the essential tools for feature extraction and classification. In addition, it motivates the development of many improved techniques based on the FDA to adapt to different problems or data types. However, none of these approaches make use of the fact that the assumption of equal covariance matrices in FDA is usually not satisfied in practical situations. Therefore, we propose a novel classification rule for the FDA that accounts for this fact, mitigating the effect of unequal covariance matrices in the FDA. Furthermore, since we only modify the classification rule, the same can be applied to many FDA variants, improving these algorithms further. Theoretical analysis reveals that the new classification rule allows the implicit use of the class covariance matrices while increasing the number of parameters to be estimated by a small amount compared to going from FDA to Quadratic Discriminant Analysis. We illustrate our idea via experiments, which show the superior performance of the modified algorithms based on our new classification rule compared to the original ones.  ( 2 min )
    A gradient estimator via L1-randomization for online zero-order optimization with two point feedback. (arXiv:2205.13910v1 [math.ST])
    This work studies online zero-order optimization of convex and Lipschitz functions. We present a novel gradient estimator based on two function evaluation and randomization on the $\ell_1$-sphere. Considering different geometries of feasible sets and Lipschitz assumptions we analyse online mirror descent algorithm with our estimator in place of the usual gradient. We consider two types of assumptions on the noise of the zero-order oracle: canceling noise and adversarial noise. We provide an anytime and completely data-driven algorithm, which is adaptive to all parameters of the problem. In the case of canceling noise that was previously studied in the literature, our guarantees are either comparable or better than state-of-the-art bounds obtained by~\citet{duchi2015} and \citet{Shamir17} for non-adaptive algorithms. Our analysis is based on deriving a new Poincar\'e type inequality for the uniform measure on the $\ell_1$-sphere with explicit constants, which may be of independent interest.  ( 2 min )
    HOUDINI: Escaping from Moderately Constrained Saddles. (arXiv:2205.13753v1 [cs.LG])
    We give the first polynomial time algorithms for escaping from high-dimensional saddle points under a moderate number of constraints. Given gradient access to a smooth function $f \colon \mathbb R^d \to \mathbb R$ we show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints. This constitutes the first tangible progress (without reliance on NP-oracles or altering the definitions to only account for certain constraints) on the main open question of the breakthrough work of Ge et al. who showed an analogous result for unconstrained and equality-constrained problems. Our results hold for both regular and stochastic gradient descent.  ( 2 min )
    Efficient Approximation of Gromov-Wasserstein Distance using Importance Sparsification. (arXiv:2205.13573v1 [cs.LG])
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for the matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to its high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called Spar-GW, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. We demonstrate that the proposed Spar-GW method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(n^{2+\delta})$ for an arbitrary small $\delta>0$. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our Spar-GW to state-of-the-art methods in both synthetic and real-world tasks.  ( 2 min )
    Combining observational datasets from multiple environments to detect hidden confounding. (arXiv:2205.13935v1 [stat.ME])
    A common assumption in causal inference from observational data is the assumption of no hidden confounding. Yet it is, in general, impossible to verify the presence of hidden confounding factors from a single dataset. However, under the assumption of independent causal mechanisms underlying the data generative process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only violated during hidden confounding and examine cases where we break its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies.  ( 2 min )
    Understanding new tasks through the lens of training data via exponential tilting. (arXiv:2205.13577v1 [cs.LG])
    Deploying machine learning models to new tasks is a major challenge despite the large size of the modern training datasets. However, it is conceivable that the training data can be reweighted to be more representative of the new (target) task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.  ( 2 min )
    Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization. (arXiv:2112.08217v2 [stat.ML] UPDATED)
    Generative networks are often trained to minimize a statistical divergence between the reference distribution and the generative one in an adversarial setting. Some works trained instead generative networks to minimize Scoring Rules, functions assessing how well the generative distribution matches each training sample individually. We show how the Scoring Rule formulation easily extends to the so-called prequential (predictive-sequential) score, whose minimization allows performing probabilistic forecasting with generative networks. This objective leads to adversarial-free training, therefore easily avoiding uncertainty underestimation due to mode collapse, which is a common issue in the adversarial setting and undesirable for probabilistic forecasting. We provide consistency guarantees for the minimizer of the prequential score and employ that to perform probabilistic forecasting for two chaotic dynamical models and a benchmark dataset of global weather observations. For this last example, we define scoring rules for spatial data by drawing from the relevant literature, with which we obtain better uncertainty quantification with little hyperparameter tuning compared to adversarial training.  ( 2 min )
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v2 [cs.LG] UPDATED)
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results  ( 2 min )
    Meta-Learning Adversarial Bandits. (arXiv:2205.14128v1 [cs.LG])
    We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a unified meta-algorithm that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-algorithm tunes the initialization, step-size, and entropy parameter of the Tsallis-entropy generalization of the well-known Exp3 method, with the task-averaged regret provably improving if the entropy of the distribution over estimated optima-in-hindsight is small. For BLO, we learn the initialization, step-size, and boundary-offset of online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with a measure induced by these functions on the interior of the action space. Our adaptive guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.  ( 2 min )
    Comparing two samples through stochastic dominance: a graphical approach. (arXiv:2203.07889v2 [stat.ML] UPDATED)
    Non-deterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples in which unpredictable outcomes are common. These measures can be modeled as random variables and compared among each other via their expected values or more sophisticated tools such as null hypothesis statistical tests. In this paper, we propose an alternative framework to visually compare two samples according to their estimated cumulative distribution functions. First, we introduce a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables scholastically dominates the other one. Then, we present a graphical method that decomposes in quantiles i) the proposed dominance measure and ii) the probability that one of the random variables takes lower values than the other. With illustrative purposes, we re-evaluate the experimentation of an already published work with the proposed methodology and we show that additional conclusions (missed by the rest of the methods) can be inferred. Additionally, the software package RVCompare was created as a convenient way of applying and experimenting with the proposed framework.  ( 2 min )
    Surrogate modeling for Bayesian optimization beyond a single Gaussian process. (arXiv:2205.14090v1 [stat.ML])
    Bayesian optimization (BO) has well-documented merits for optimizing black-box functions with an expensive evaluation cost. Such functions emerge in applications as diverse as hyperparameter tuning, drug discovery, and robotics. BO hinges on a Bayesian surrogate model to sequentially select query points so as to balance exploration with exploitation of the search space. Most existing works rely on a single Gaussian process (GP) based surrogate model, where the kernel function form is typically preselected using domain knowledge. To bypass such a design process, this paper leverages an ensemble (E) of GPs to adaptively select the surrogate model fit on-the-fly, yielding a GP mixture posterior with enhanced expressiveness for the sought function. Acquisition of the next evaluation input using this EGP-based function posterior is then enabled by Thompson sampling (TS) that requires no additional design parameters. To endow function sampling with scalability, random feature-based kernel approximation is leveraged per GP model. The novel EGP-TS readily accommodates parallel operation. To further establish convergence of the proposed EGP-TS to the global optimum, analysis is conducted based on the notion of Bayesian regret for both sequential and parallel settings. Tests on synthetic functions and real-world applications showcase the merits of the proposed method.  ( 2 min )
    A Unified Analysis of Federated Learning with Arbitrary Client Participation. (arXiv:2205.13648v1 [cs.LG])
    Federated learning (FL) faces challenges of intermittent client availability and computation/communication efficiency. As a result, only a small subset of clients can participate in FL at a given time. It is important to understand how partial client participation affects convergence, but most existing works have either considered idealized participation patterns or obtained results with non-zero optimality error for generic patterns. In this paper, we provide a unified convergence analysis for FL with arbitrary client participation. We first introduce a generalized version of federated averaging (FedAvg) that amplifies parameter updates at an interval of multiple FL rounds. Then, we present a novel analysis that captures the effect of client participation in a single term. By analyzing this term, we obtain convergence upper bounds for a wide range of participation patterns, including both non-stochastic and stochastic cases, which match either the lower bound of stochastic gradient descent (SGD) or the state-of-the-art results in specific settings. We also discuss various insights, recommendations, and experimental results.  ( 2 min )
    DP-PCA: Statistically Optimal and Differentially Private PCA. (arXiv:2205.13709v1 [cs.LG])
    We study the canonical statistical task of computing the principal component from $n$ i.i.d.~data in $d$ dimensions under $(\varepsilon,\delta)$-differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: ($i$) even for Gaussian data, existing private algorithms require the number of samples $n$ to scale super-linearly with $d$, i.e., $n=\Omega(d^{3/2})$, to obtain non-trivial results while non-private PCA requires only $n=O(d)$, and ($ii$) existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on {\em private mean estimation}, which adds minimal noise required to ensure privacy by adapting to the variance of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for $n=\tilde O(d)$. Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.  ( 2 min )
    Evolution of beliefs in social networks. (arXiv:2205.13587v1 [cs.LG])
    Evolution of beliefs of a society are a product of interactions between people (horizontal transmission) in the society over generations (vertical transmission). Researchers have studied both horizontal and vertical transmission separately. Extending prior work, we propose a new theoretical framework which allows application of tools from Markov chain theory to the analysis of belief evolution via horizontal and vertical transmission. We analyze three cases: static network, randomly changing network, and homophily-based dynamic network. Whereas the former two assume network structure is independent of beliefs, the latter assumes that people tend to communicate with those who have similar beliefs. We prove under general conditions that both static and randomly changing networks converge to a single set of beliefs among all individuals along with the rate of convergence. We prove that homophily-based network structures do not in general converge to a single set of beliefs shared by all and prove lower bounds on the number of different limiting beliefs as a function of initial beliefs. We conclude by discussing implications for prior theories and directions for future work.  ( 2 min )
    Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap. (arXiv:2203.13457v2 [cs.LG] UPDATED)
    Recently, contrastive learning has risen to be a promising approach for large-scale self-supervised learning. However, theoretical understanding of how it works is still unclear. In this paper, we propose a new guarantee on the downstream performance without resorting to the conditional independence assumption that is widely adopted in previous work but hardly holds in practice. Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations, thus simply aligning the positive samples (augmented views of the same sample) could make contrastive learning cluster intra-class samples together. Based on this augmentation overlap perspective, theoretically, we obtain asymptotically closed bounds for downstream performance under weaker assumptions, and empirically, we propose an unsupervised model selection metric ARC that aligns well with downstream accuracy. Our theory suggests an alternative understanding of contrastive learning: the role of aligning positive samples is more like a surrogate task than an ultimate goal, and the overlapped augmented views (i.e., the chaos) create a ladder for contrastive learning to gradually learn class-separated representations. The code for computing ARC is available at https://github.com/zhangq327/ARC.  ( 2 min )
    Topological Hidden Markov Models. (arXiv:2205.13608v1 [stat.ME])
    The hidden Markov model (HMM) is a classic modeling tool with a wide swath of applications. Its inception considered observations restricted to a finite alphabet, but it was quickly extended to multivariate continuous distributions. In this article, we further extend the HMM from mixtures of normal distributions in $d$-dimensional Euclidean space to general Gaussian measure mixtures in locally convex topological spaces. The main innovation is the use of the Onsager-Machlup functional as a proxy for the probability density function in infinite dimensional spaces. This allows for choice of a Cameron-Martin space suitable for a given application. We demonstrate the versatility of this methodology by applying it to simulated diffusion processes such as Brownian and fractional Brownian sample paths as well as the Ornstein-Uhlenbeck process. Our methodology is applied to the identification of sleep states from overnight polysomnography time series data with the aim of diagnosing Obstructive Sleep Apnea in pediatric patients. It is also applied to a series of annual cumulative snowfall curves from 1940 to 1990 in the city of Edmonton, Alberta.
    Finite mixture of skewed sub-Gaussian stable distributions. (arXiv:2205.14067v1 [stat.ME])
    We propose the finite mixture of skewed sub-Gaussian stable distributions. The maximum likelihood estimator for the parameters of proposed finite mixture model is computed through the expectation-maximization algorithm. The proposed model contains the finite mixture of normal and skewed normal distributions. Since the tails of proposed model is heavier than even the Student's t distribution, it can be used as a powerful model for robust model-based clustering. Performance of the proposed model is demonstrated by clustering simulation data and two sets of real data.  ( 2 min )
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v3 [cs.CR] UPDATED)
    Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
    Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. (arXiv:2205.13571v1 [cs.LG])
    Neural networks have achieved tremendous success in a large variety of applications. However, their memory footprint and computational demand can render them impractical in application settings with limited hardware or energy resources. In this work, we propose a novel algorithm to find efficient low-rank subnetworks. Remarkably, these subnetworks are determined and adapted already during the training phase and the overall time and memory resources required by both training and evaluating them is significantly reduced. The main idea is to restrict the weight matrices to a low-rank manifold and to update the low-rank factors rather than the full matrix during training. To derive training updates that are restricted to the prescribed manifold, we employ techniques from dynamic model order reduction for matrix differential equations. Moreover, our method automatically and dynamically adapts the ranks during training to achieve a desired approximation accuracy. The efficiency of the proposed method is demonstrated through a variety of numerical experiments on fully-connected and convolutional networks.
    MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. (arXiv:2205.13869v1 [cs.LG])
    State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm is unaware of the causal discovery step. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph prior as an inductive bias. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
    Average Adjusted Association: Efficient Estimation with High Dimensional Confounders. (arXiv:2205.14048v1 [stat.ME])
    The log odds ratio is a common parameter to measure association between (binary) outcome and exposure variables. Much attention has been paid to its parametric but robust estimation, or its nonparametric estimation as a function of confounders. However, discussion on how to use a summary statistic by averaging the log odds ratio function is surprisingly difficult to find despite the popularity and importance of averaging in other contexts such as estimating the average treatment effect. We propose a couple of efficient double/debiased machine learning (DML) estimators of the average log odds ratio, where the odds ratios are adjusted for observed (potentially high dimensional) confounders and are averaged over them. The estimators are built from two equivalent forms of the efficient influence function. The first estimator uses a prospective probability of the outcome conditional on the exposure and confounders; the second one employs a retrospective probability of the exposure conditional on the outcome and confounders. Our framework encompasses random sampling as well as outcome-based or exposure-based sampling. Finally, we illustrate how to apply the proposed estimators using real data.
    Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms. (arXiv:2107.03863v3 [stat.ML] UPDATED)
    Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure for such models is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel Snakemake workflow, called Benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BDgraph, BiDAG, bnlearn, gCastle, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the workflow also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking study. We demonstrate the applicability of this workflow for learning Bayesian networks in five typical data scenarios. The source code and documentation is publicly available from this http URL
    Explaining Preferences with Shapley Values. (arXiv:2205.13662v1 [stat.ML])
    While preference modelling is becoming one of the pillars of machine learning, the problem of preference explanation remains challenging and underexplored. In this paper, we propose \textsc{Pref-SHAP}, a Shapley value-based model explanation framework for pairwise comparison data. We derive the appropriate value functions for preference models and further extend the framework to model and explain \emph{context specific} information, such as the surface type in a tennis game. To demonstrate the utility of \textsc{Pref-SHAP}, we apply our method to a variety of synthetic and real-world datasets and show that richer and more insightful explanations can be obtained over the baseline.
    A Multilabel Classification Framework for Approximate Nearest Neighbor Search. (arXiv:1910.08322v4 [cs.LG] UPDATED)
    Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees.
    Asymptotic Convergence Rate and Statistical Inference for Stochastic Sequential Quadratic Programming. (arXiv:2205.13687v1 [math.OC])
    We apply a stochastic sequential quadratic programming (StoSQP) algorithm to solve constrained nonlinear optimization problems, where the objective is stochastic and the constraints are deterministic. We study a fully stochastic setup, where only a single sample is available in each iteration for estimating the gradient and Hessian of the objective. We allow StoSQP to select a random stepsize $\bar{\alpha}_t$ adaptively, such that $\beta_t\leq \bar{\alpha}_t \leq \beta_t+\chi_t$, where $\beta_t$, $\chi_t=o(\beta_t)$ are prespecified deterministic sequences. We also allow StoSQP to solve Newton system inexactly via randomized iterative solvers, e.g., with the sketch-and-project method; and we do not require the approximation error of inexact Newton direction to vanish. For this general StoSQP framework, we establish the asymptotic convergence rate for its last iterate, with the worst-case iteration complexity as a byproduct; and we perform statistical inference. In particular, with proper decaying $\beta_t,\chi_t$, we show that: (i) the StoSQP scheme can take at most $O(1/\epsilon^4)$ iterations to achieve $\epsilon$-stationarity; (ii) asymptotically and almost surely, $\|(x_t -x^\star, \lambda_t - \lambda^\star)\| = O(\sqrt{\beta_t\log(1/\beta_t)})+O(\chi_t/\beta_t)$, where $(x_t,\lambda_t)$ is the primal-dual StoSQP iterate; (iii) the sequence $1/\sqrt{\beta_t}\cdot (x_t -x^\star, \lambda_t - \lambda^\star)$ converges to a mean zero Gaussian distribution with a nontrivial covariance matrix. Moreover, we establish the Berry-Esseen bound for $(x_t, \lambda_t)$ to measure quantitatively the convergence of its distribution function. We also provide a practical estimator for the covariance matrix, from which the confidence intervals of $(x^\star, \lambda^\star)$ can be constructed using iterates $\{(x_t,\lambda_t)\}_t$. Our theorems are validated using nonlinear problems in CUTEst test set.
    Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes. (arXiv:2205.13589v1 [cs.LG])
    We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
    Error Bound of Empirical $\ell_2$ Risk Minimization for Noisy Standard and Generalized Phase Retrieval Problems. (arXiv:2205.13827v1 [stat.ML])
    A noisy generalized phase retrieval (NGPR) problem refers to a problem of estimating $x_0 \in \mathbb{C}^d$ by noisy quadratic samples $\big\{x_0^*A_kx_0+\eta_k\big\}_{k=1}^n$ where $A_k$ is a Hermitian matrix and $\eta_k$ is a noise scalar. When $A_k=\alpha_k\alpha_k^*$ for some $\alpha_k\in\mathbb{C}^d$, it reduces to a standard noisy phase retrieval (NPR) problem. The main aim of this paper is to study the estimation performance of empirical $\ell_2$ risk minimization in both problems when $A_k$ in NGPR, or $\alpha_k$ in NPR, is drawn from sub-Gaussian distribution. Under different kinds of noise patterns, we establish error bounds that can imply approximate reconstruction and these results are new in the literature. In NGPR, we show the bounds are of $O\big(\frac{||\eta||}{\sqrt{n}}\big)$ and $O\big(||\eta||_\infty \sqrt{\frac{d}{n}}\big)$ for general noise, and of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$ for random noise with sub-Gaussian and sub-exponential tail respectively, where $\| \eta \|$ and $\| \eta \|_{\infty}$ are the 2-norm and sup-norm of the noise vector of $\eta_k$. Under heavy-tailed noise, by truncating response outliers we propose a robust estimator that possesses an error bound with slower convergence rate. On the other hand, we obtain in NPR the bound is of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$) for sub-Gaussian and sub-exponential noise respectively, which is essentially tighter than the existing bound $O\big(\frac{||\eta||_2}{\sqrt{n}}\big)$. Although NGPR involving measurement matrix $A_k$ is more computationally demanding than NPR involving measurement vector $\alpha_k$, our results reveal that NGPR exhibits stronger robustness than NPR under biased and deterministic noise. Experimental results are presented to confirm and demonstrate our theoretical findings.
  • Open

    Comparing two samples through stochastic dominance: a graphical approach. (arXiv:2203.07889v2 [stat.ML] UPDATED)
    Non-deterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples in which unpredictable outcomes are common. These measures can be modeled as random variables and compared among each other via their expected values or more sophisticated tools such as null hypothesis statistical tests. In this paper, we propose an alternative framework to visually compare two samples according to their estimated cumulative distribution functions. First, we introduce a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables scholastically dominates the other one. Then, we present a graphical method that decomposes in quantiles i) the proposed dominance measure and ii) the probability that one of the random variables takes lower values than the other. With illustrative purposes, we re-evaluate the experimentation of an already published work with the proposed methodology and we show that additional conclusions (missed by the rest of the methods) can be inferred. Additionally, the software package RVCompare was created as a convenient way of applying and experimenting with the proposed framework.  ( 2 min )
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v3 [cs.CR] UPDATED)
    Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.  ( 2 min )
    Seeing Differently, Acting Similarly: Heterogeneously Observable Imitation Learning. (arXiv:2106.09256v3 [cs.LG] UPDATED)
    In many real-world imitation learning tasks, the demonstrator and the learner have to act under totally different observation spaces. This situation brings significant obstacles to existing imitation learning approaches, since most of them learn policies under homogeneous observation spaces. On the other hand, previous studies under different observation spaces have strong assumptions that these two observation spaces coexist during the entire learning process. However, in reality, the observation coexistence will be limited due to the high cost of acquiring expert observations. In this work, we study this challenging problem with limited observation coexistence under heterogeneous observations: Heterogeneously Observable Imitation Learning (HOIL). We identify two underlying issues in HOIL, i.e., the dynamics mismatch and the support mismatch, and further propose the Importance Weighting with REjection (IWRE) algorithm based on importance-weighting and learning with rejection to solve HOIL problems. Experimental results show that IWRE can successfully solve various HOIL tasks, including the challenging tasks of transforming the vision-based demonstrations to random access memory (RAM)-based policies in the Atari domain, even with limited visual observations.  ( 2 min )
    Faster Optimization on Sparse Graphs via Neural Reparametrization. (arXiv:2205.13624v1 [cs.LG])
    In mathematical optimization, second-order Newton's methods generally converge faster than first-order methods, but they require the inverse of the Hessian, hence are computationally expensive. However, we discover that on sparse graphs, graph neural networks (GNN) can implement an efficient Quasi-Newton method that can speed up optimization by a factor of 10-100x. Our method, neural reparametrization, modifies the optimization parameters as the output of a GNN to reshape the optimization landscape. Using a precomputed Hessian as the propagation rule, the GNN can effectively utilize the second-order information, reaching a similar effect as adaptive gradient methods. As our method solves optimization through architecture design, it can be used in conjunction with any optimizers such as Adam and RMSProp. We show the application of our method on scientifically relevant problems including heat diffusion, synchronization and persistent homology.  ( 2 min )
    Notes on Generalizing the Maximum Entropy Principle to Uncertain Data. (arXiv:2109.04530v2 [cs.IT] UPDATED)
    The principle of maximum entropy is a broadly applicable technique for computing a distribution with the least amount of information possible constrained to match empirical data, for instance, feature expectations. We seek to generalize this principle to scenarios where the empirical feature expectations cannot be computed because the model variables are only partially observed, which introduces a dependency on the learned model. Generalizing the principle of latent maximum entropy, we introduce uncertain maximum entropy and describe an expectation-maximization based solution to approximately solve these problems. We show that our technique additionally generalizes the principle of maximum entropy. We additionally discuss the use of black box classifiers with our technique, which simplifies the process of utilizing sparse, large data sets.  ( 2 min )
    Characterizing Parametric and Convergence Stability in Nonconvex and Nonsmooth Optimizations: A Geometric Approach. (arXiv:2204.01643v2 [cs.GT] UPDATED)
    We consider stability issues in minimizing a continuous (probably parameterized, nonconvex and nonsmooth) real-valued function $f$. We call a point stationary if all its possible directional derivatives are nonnegative. In this work, we focus on two notions of stability on stationary points of $f$: parametric stability and convergence stability. Parametric considerations are widely studied in various fields, including smoothed analysis, numerical stability, condition numbers and sensitivity analysis for linear programming. Parametric stability asks whether minor perturbations on parameters lead to dramatic changes in the position and $f$ value of a stationary point. Meanwhile, convergence stability indicates a non-escapable solution: Any point sequence iteratively produced by an optimization algorithm cannot escape from a neighborhood of a stationary point but gets close to it in the sense that such stationary points are stable to the precision parameter and algorithmic numerical errors. It turns out that these notions have deep connections to geometry theory. We show that parametric stability is linked to deformations of graphs of functions. On the other hand, convergence stability is concerned with area partitioning of the function domain. Utilizing these connections, we prove quite tight conditions of these two stability notions for a wide range of functions and optimization algorithms with small enough step sizes and precision parameters. These conditions are subtle in the sense that a slightly weaker function requirement goes to the opposite of primitive intuitions and leads to wrong conclusions. We present three applications of this theory. These applications reveal some understanding on Nash equilibrium computation, nonconvex and nonsmooth optimization, as well as the new optimization methodology of deep neural networks.  ( 3 min )
    Waiting but not Aging: Optimizing Information Freshness Under the Pull Model. (arXiv:1912.08722v4 [cs.NI] UPDATED)
    The Age-of-Information is an important metric for investigating the timeliness performance in information-update systems. In this paper, we study the AoI minimization problem under a new Pull model with replication schemes, where a user proactively sends a replicated request to multiple servers to "pull" the information of interest. Interestingly, we find that under this new Pull model, replication schemes capture a novel tradeoff between different values of the AoI across the servers (due to the random updating processes) and different response times across the servers, which can be exploited to minimize the expected AoI at the user's side. Specifically, assuming Poisson updating process for the servers and exponentially distributed response time, we derive a closed-form formula for computing the expected AoI and obtain the optimal number of responses to wait for to minimize the expected AoI. Then, we extend our analysis to the setting where the user aims to maximize the AoI-based utility, which represents the user's satisfaction level with respect to the freshness of the received information. Furthermore, we consider a more realistic scenario where the user has no prior knowledge of the system. In this case, we reformulate the utility maximization problem as a stochastic Multi-Armed Bandit problem with side observations and leverage a special linear structure of side observations to design learning algorithms with improved performance guarantees. Finally, we conduct extensive simulations to elucidate our theoretical results and compare the performance of different algorithms. Our findings reveal that under the Pull model, waiting does not necessarily lead to aging; waiting for more than one response can often significantly reduce the AoI and improve the AoI-based utility in most scenarios.  ( 3 min )
    C$^2$SP-Net: Joint Compression and Classification Network for Epilepsy Seizure Prediction. (arXiv:2110.13674v2 [cs.LG] UPDATED)
    Recent development in brain-machine interface technology has made seizure prediction possible. However, the communication of large volume of electrophysiological signals between sensors and processing apparatus and related computation become two major bottlenecks for seizure prediction systems due to the constrained bandwidth and limited computation resource, especially for wearable and implantable medical devices. Although compressive sensing (CS) can be adopted to compress the signals to reduce communication bandwidth requirement, it needs a complex reconstruction procedure before the signal can be used for seizure prediction. In this paper, we propose C$^2$SP-Net, to jointly solve compression, prediction, and reconstruction with a single neural network. A plug-and-play in-sensor compression matrix is constructed to reduce transmission bandwidth requirement. The compressed signal can be used for seizure prediction without additional reconstruction steps. Reconstruction of the original signal can also be carried out in high fidelity. Prediction accuracy, sensitivity, false prediction rate, and reconstruction quality of the proposed framework are evaluated under various compression ratios. The experimental results illustrate that our model outperforms the competitive state-of-the-art baselines by a large margin in prediction accuracy. In particular, our proposed method produces an average loss of 0.35 % in prediction accuracy with a compression ratio ranging from 1/2 to 1/16.  ( 2 min )
    Evaluating the Robustness of Deep Reinforcement Learning for Autonomous and Adversarial Policies in a Multi-agent Urban Driving Environment. (arXiv:2112.11947v2 [cs.AI] UPDATED)
    Deep reinforcement learning is actively used for training autonomous and adversarial car policies in a simulated driving environment. Due to the large availability of various reinforcement learning algorithms and the lack of their systematic comparison across different driving scenarios, we are unsure of which ones are more effective for training and testing autonomous car software in single-agent as well as multi-agent driving environments. A benchmarking framework for the comparison of deep reinforcement learning in a vision-based autonomous driving will open up the possibilities for training better autonomous car driving policies. Furthermore, autonomous cars trained on deep reinforcement learning-based algorithms are known for being vulnerable to adversarial attacks. To guard against adversarial attacks, we can train autonomous cars on adversarial driving policies. However, we lack the knowledge of which deep reinforcement learning algorithms would act as good adversarial agents able to effectively test autonomous cars. To address these challenges, we provide an open and reusable benchmarking framework for systematic evaluation and comparative analysis of deep reinforcement learning algorithms for autonomous and adversarial driving in a single- and multi-agent environment. Using the framework, we perform a comparative study of five discrete and two continuous action space deep reinforcement learning algorithms. We run the experiments in a vision-only high fidelity urban driving simulated environments. The results indicate that only some of the deep reinforcement learning algorithms perform consistently better across single and multi-agent scenarios when trained in a multi-agent-only setting.  ( 2 min )
    PSL is Dead. Long Live PSL. (arXiv:2205.14136v1 [cs.LG])
    Property Specification Language (PSL) is a form of temporal logic that has been mainly used in discrete domains (e.g. formal hardware verification). In this paper, we show that by merging machine learning techniques with PSL monitors, we can extend PSL to work on continuous domains. We apply this technique in machine learning-based anomaly detection to analyze scenarios of real-time streaming events from continuous variables in order to detect abnormal behaviors of a system. By using machine learning with formal models, we leverage the strengths of both machine learning methods and formal semantics of time. On one hand, machine learning techniques can produce distributions on continuous variables, where abnormalities can be captured as deviations from the distributions. On the other hand, formal methods can characterize discrete temporal behaviors and relations that cannot be easily learned by machine learning techniques. Interestingly, the anomalies detected by machine learning and the underlying time representation used are discrete events. We implemented a temporal monitoring package (TEF) that operates in conjunction with normal data science packages for anomaly detection machine learning systems, and we show that TEF can be used to perform accurate interpretation of temporal correlation between events.  ( 2 min )
    A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning. (arXiv:2112.15400v2 [cs.LG] UPDATED)
    Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures wherein the outer-loop meta-learner guides the inner-loop gradient-based reinforcement learner to achieve fast adaptations. In this paper, we develop a unified framework that describes variations of GMRL algorithms and points out that existing stochastic meta-gradient estimators adopted by GMRL are actually \textbf{biased}. Such meta-gradient bias comes from two sources: 1) the compositional bias incurred by the two-level problem structure, which has an upper bound of $\mathcal{O}\big(K\alpha^{K}\hat{\sigma}_{\text{In}}|\tau|^{-0.5}\big)$ \emph{w.r.t.} inner-loop update step $K$, learning rate $\alpha$, estimate variance $\hat{\sigma}^{2}_{\text{In}}$ and sample size $|\tau|$, and 2) the multi-step Hessian estimation bias $\hat{\Delta}_{H}$ due to the use of autodiff, which has a polynomial impact $\mathcal{O}\big((K-1)(\hat{\Delta}_{H})^{K-1}\big)$ on the meta-gradient bias. We study tabular MDPs empirically and offer quantitative evidence that testifies our theoretical findings on existing stochastic meta-gradient estimators. Furthermore, we conduct experiments on Iterated Prisoner's Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.  ( 2 min )
    Spelunking the Deep: Guaranteed Queries on General Neural Implicit Surfaces via Range Analysis. (arXiv:2202.02444v2 [cs.CV] UPDATED)
    Neural implicit representations, which encode a surface as the level set of a neural network applied to spatial coordinates, have proven to be remarkably effective for optimizing, compressing, and generating 3D geometry. Although these representations are easy to fit, it is not clear how to best evaluate geometric queries on the shape, such as intersecting against a ray or finding a closest point. The predominant approach is to encourage the network to have a signed distance property. However, this property typically holds only approximately, leading to robustness issues, and holds only at the conclusion of training, inhibiting the use of queries in loss functions. Instead, this work presents a new approach to perform queries directly on general neural implicit functions for a wide range of existing architectures. Our key tool is the application of range analysis to neural networks, using automatic arithmetic rules to bound the output of a network over a region; we conduct a study of range analysis on neural networks, and identify variants of affine arithmetic which are highly effective. We use the resulting bounds to develop geometric queries including ray casting, intersection testing, constructing spatial hierarchies, fast mesh extraction, closest-point evaluation, evaluating bulk properties, and more. Our queries can be efficiently evaluated on GPUs, and offer concrete accuracy guarantees even on randomly-initialized networks, enabling their use in training objectives and beyond. We also show a preliminary application to inverse rendering.  ( 2 min )
    A Two-Stage Federated Transfer Learning Framework in Medical Images Classification on Limited Data: A COVID-19 Case Study. (arXiv:2203.12803v2 [eess.IV] UPDATED)
    COVID-19 pandemic has spread rapidly and caused a shortage of global medical resources. The efficiency of COVID-19 diagnosis has become highly significant. As deep learning and convolutional neural network (CNN) has been widely utilized and been verified in analyzing medical images, it has become a powerful tool for computer-assisted diagnosis. However, there are two most significant challenges in medical image classification with the help of deep learning and neural networks, one of them is the difficulty of acquiring enough samples, which may lead to model overfitting. Privacy concerns mainly bring the other challenge since medical-related records are often deemed patients' private information and protected by laws such as GDPR and HIPPA. Federated learning can ensure the model training is decentralized on different devices and no data is shared among them, which guarantees privacy. However, with data located on different devices, the accessible data of each device could be limited. Since transfer learning has been verified in dealing with limited data with good performance, therefore, in this paper, We made a trial to implement federated learning and transfer learning techniques using CNNs to classify COVID-19 using lung CT scans. We also explored the impact of dataset distribution at the client-side in federated learning and the number of training epochs a model is trained. Finally, we obtained very high performance with federated learning, demonstrating our success in leveraging accuracy and privacy.  ( 3 min )
    A Comprehensive Survey on Radio Frequency (RF) Fingerprinting: Traditional Approaches, Deep Learning, and Open Challenges. (arXiv:2201.00680v2 [cs.LG] UPDATED)
    Fifth generation (5G) networks and beyond envisions massive Internet of Things (IoT) rollout to support disruptive applications such as extended reality (XR), augmented/virtual reality (AR/VR), industrial automation, autonomous driving, and smart everything which brings together massive and diverse IoT devices occupying the radio frequency (RF) spectrum. Along with spectrum crunch and throughput challenges, such a massive scale of wireless devices exposes unprecedented threat surfaces. RF fingerprinting is heralded as a candidate technology that can be combined with cryptographic and zero-trust security measures to ensure data privacy, confidentiality, and integrity in wireless networks. Motivated by the relevance of this subject in the future communication networks, in this work, we present a comprehensive survey of RF fingerprinting approaches ranging from a traditional view to the most recent deep learning (DL) based algorithms. Existing surveys have mostly focused on a constrained presentation of the wireless fingerprinting approaches, however, many aspects remain untold. In this work, however, we mitigate this by addressing every aspect - background on signal intelligence (SIGINT), applications, relevant DL algorithms, systematic literature review of RF fingerprinting techniques spanning the past two decades, discussion on datasets, and potential research avenues - necessary to elucidate this topic to the reader in an encyclopedic manner.  ( 2 min )
    Average Adjusted Association: Efficient Estimation with High Dimensional Confounders. (arXiv:2205.14048v1 [stat.ME])
    The log odds ratio is a common parameter to measure association between (binary) outcome and exposure variables. Much attention has been paid to its parametric but robust estimation, or its nonparametric estimation as a function of confounders. However, discussion on how to use a summary statistic by averaging the log odds ratio function is surprisingly difficult to find despite the popularity and importance of averaging in other contexts such as estimating the average treatment effect. We propose a couple of efficient double/debiased machine learning (DML) estimators of the average log odds ratio, where the odds ratios are adjusted for observed (potentially high dimensional) confounders and are averaged over them. The estimators are built from two equivalent forms of the efficient influence function. The first estimator uses a prospective probability of the outcome conditional on the exposure and confounders; the second one employs a retrospective probability of the exposure conditional on the outcome and confounders. Our framework encompasses random sampling as well as outcome-based or exposure-based sampling. Finally, we illustrate how to apply the proposed estimators using real data.
    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. (arXiv:2204.00598v2 [cs.CV] UPDATED)
    Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.
    Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. (arXiv:2205.14065v1 [cs.CV])
    Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to resolve many critical limitations of traditional single-vector representations such as poor systematic generalization. Although there have been many remarkable advances in recent years, one of the most critical problems in this direction has been that previous methods work only with simple and synthetic scenes but not with complex and naturalistic images or videos. In this paper, we propose STEVE, an unsupervised model for object-centric learning in videos. Our proposed model makes a significant advancement by demonstrating its effectiveness on various complex and naturalistic videos unprecedented in this line of research. Interestingly, this is achieved by neither adding complexity to the model architecture nor introducing a new objective or weak supervision. Rather, it is achieved by a surprisingly simple architecture that uses a transformer-based image decoder conditioned on slots and the learning objective is simply to reconstruct the observation. Our experiment results on various complex and naturalistic videos show significant improvements compared to the previous state-of-the-art.
    FairCanary: Rapid Continuous Explainable Fairness. (arXiv:2106.07057v2 [cs.LG] UPDATED)
    Systems that offer continuous model monitoring have emerged in response to (1) well-documented failures of deployed Machine Learning (ML) and Artificial Intelligence (AI) models and (2) new regulatory requirements impacting these models. Existing monitoring systems continuously track the performance of deployed ML models and compute feature importance (a.k.a. explanations) for each prediction to help developers identify the root causes of emergent model performance problems. We present Quantile Demographic Drift (QDD), a novel model bias quantification metric that uses quantile binning to measure differences in the overall prediction distributions over subgroups. QDD is ideal for continuous monitoring scenarios, does not suffer from the statistical limitations of conventional threshold-based bias metrics, and does not require outcome labels (which may not be available at runtime). We incorporate QDD into a continuous model monitoring system, called FairCanary, that reuses existing explanations computed for each individual prediction to quickly compute explanations for the QDD bias metrics. This optimization makes FairCanary an order of magnitude faster than previous work that has tried to generate feature-level bias explanations.
    Supervised Training of Siamese Spiking Neural Networks with Earth Mover's Distance. (arXiv:2203.13207v2 [cs.NE] UPDATED)
    This study adapts the highly-versatile siamese neural network model to the event data domain. We introduce a supervised training framework for optimizing Earth Mover's Distance (EMD) between spike trains with spiking neural networks (SNN). We train this model on images of the MNIST dataset converted into spiking domain with novel conversion schemes. The quality of the siamese embeddings of input images was evaluated by measuring the classifier performance for different dataset coding types. The models achieved performance similar to existing SNN-based approaches (F1-score of up to 0.9386) while using only about 15% of hidden layer neurons to classify each example. Furthermore, models which did not employ a sparse neural code were about 45% slower than their sparse counterparts. These properties make the model suitable for low energy consumption and low prediction latency applications.
    VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. (arXiv:2111.02358v2 [cs.CV] UPDATED)
    We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at https://aka.ms/vlmo.
    ES-GNN: Generalizing Graph Neural Networks Beyond Homophily with Edge Splitting. (arXiv:2205.13700v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved enormous success in tackling analytical problems on graph data. Most GNNs interpret nearly all the node connections as inductive bias with feature smoothness, and implicitly assume strong homophily on the observed graph. However, real-world networks are not always homophilic, but sometimes exhibit heterophilic patterns where adjacent nodes share dissimilar attributes and distinct labels. Therefore,GNNs smoothing the node proximity holistically may aggregate inconsistent information arising from both task-relevant and irrelevant connections. In this paper, we propose a novel edge splitting GNN (ES-GNN) framework, which generalizes GNNs beyond homophily by jointly partitioning network topology and disentangling node features. Specifically, the proposed framework employs an interpretable operation to adaptively split the set of edges of the original graph into two exclusive sets indicating respectively the task-relevant and irrelevant relations among nodes. The node features are then aggregated separately on these two partial edge sets to produce disentangled representations, based on which a more accurate edge splitting can be attained later. Theoretically, we show that our ES-GNN can be regarded as a solution to a graph denoising problem with a disentangled smoothness assumption, which further illustrates our motivations and interprets the improved generalization. Extensive experiments over 8 benchmark and 1 synthetic datasets demonstrate that ES-GNN not only outperforms the state-of-the-arts (including 8 GNN baselines), but also can be more robust to adversarial graphs and alleviate the over-smoothing problem.
    On the Sample Complexity of Decentralized Linear Quadratic Regulator with Partially Nested Information Structure. (arXiv:2110.07112v2 [math.OC] UPDATED)
    We study the problem of control policy design for decentralized state-feedback linear quadratic control with a partially nested information structure, when the system model is unknown. We propose a model-based learning solution, which consists of two steps. First, we estimate the unknown system model from a single system trajectory of finite length, using least squares estimation. Next, based on the estimated system model, we design a control policy that satisfies the desired information structure. We show that the suboptimality gap between our control policy and the optimal decentralized control policy (designed using accurate knowledge of the system model) scales linearly with the estimation error of the system model. Using this result, we provide an end-to-end sample complexity result for learning decentralized controllers for a linear quadratic control problem with a partially nested information structure.
    Meta-Learning Adversarial Bandits. (arXiv:2205.14128v1 [cs.LG])
    We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a unified meta-algorithm that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-algorithm tunes the initialization, step-size, and entropy parameter of the Tsallis-entropy generalization of the well-known Exp3 method, with the task-averaged regret provably improving if the entropy of the distribution over estimated optima-in-hindsight is small. For BLO, we learn the initialization, step-size, and boundary-offset of online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with a measure induced by these functions on the interior of the action space. Our adaptive guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.  ( 2 min )
    Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications. (arXiv:2205.13963v1 [cs.DC])
    This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.  ( 2 min )
    Surrogate modeling for Bayesian optimization beyond a single Gaussian process. (arXiv:2205.14090v1 [stat.ML])
    Bayesian optimization (BO) has well-documented merits for optimizing black-box functions with an expensive evaluation cost. Such functions emerge in applications as diverse as hyperparameter tuning, drug discovery, and robotics. BO hinges on a Bayesian surrogate model to sequentially select query points so as to balance exploration with exploitation of the search space. Most existing works rely on a single Gaussian process (GP) based surrogate model, where the kernel function form is typically preselected using domain knowledge. To bypass such a design process, this paper leverages an ensemble (E) of GPs to adaptively select the surrogate model fit on-the-fly, yielding a GP mixture posterior with enhanced expressiveness for the sought function. Acquisition of the next evaluation input using this EGP-based function posterior is then enabled by Thompson sampling (TS) that requires no additional design parameters. To endow function sampling with scalability, random feature-based kernel approximation is leveraged per GP model. The novel EGP-TS readily accommodates parallel operation. To further establish convergence of the proposed EGP-TS to the global optimum, analysis is conducted based on the notion of Bayesian regret for both sequential and parallel settings. Tests on synthetic functions and real-world applications showcase the merits of the proposed method.
    What Dense Graph Do You Need for Self-Attention?. (arXiv:2205.14014v1 [cs.LG])
    Transformers have made progress in miscellaneous tasks, but suffer from quadratic computational and memory complexities. Recent works propose sparse Transformers with attention on sparse graphs to reduce complexity and remain strong performance. While effective, the crucial parts of how dense a graph needs to be to perform well are not fully explored. In this paper, we propose Normalized Information Payload (NIP), a graph scoring function measuring information transfer on graph, which provides an analysis tool for trade-offs between performance and complexity. Guided by this theoretical analysis, we present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer while yielding $O(N\log N)$ complexity with sequence length $N$. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.  ( 2 min )
    Does Momentum Change the Implicit Regularization on Separable Data?. (arXiv:2110.03891v2 [cs.LG] UPDATED)
    The momentum acceleration technique is widely adopted in many optimization algorithms. However, there is no theoretical answer on how the momentum affects the generalization performance of the optimization algorithms. This paper studies this problem by analyzing the implicit regularization of momentum-based optimization. We prove that on the linear classification problem with separable data and exponential-tailed loss, gradient descent with momentum (GDM) converges to the L2 max-margin solution, which is the same as vanilla gradient descent. That means gradient descent with momentum acceleration still converges to a low-complexity model, which guarantees their generalization. We then analyze the stochastic and adaptive variants of GDM (i.e., SGDM and deterministic Adam) and show they also converge to the L2 max-margin solution. Technically, to overcome the difficulty of the error accumulation in analyzing the momentum, we construct new potential functions to analyze the gap between the model parameter and the max-margin solution. Numerical experiments are conducted and support our theoretical results.  ( 2 min )
    Dynamic Domain Generalization. (arXiv:2205.13913v1 [cs.LG])
    Domain generalization (DG) is a fundamental yet very challenging research topic in machine learning. The existing arts mainly focus on learning domain-invariant features with limited source domains in a static model. Unfortunately, there is a lack of training-free mechanism to adjust the model when generalized to the agnostic target domains. To tackle this problem, we develop a brand-new DG variant, namely Dynamic Domain Generalization (DDG), in which the model learns to twist the network parameters to adapt the data from different domains. Specifically, we leverage a meta-adjuster to twist the network parameters based on the static model with respect to different data from different domains. In this way, the static model is optimized to learn domain-shared features, while the meta-adjuster is designed to learn domain-specific features. To enable this process, DomainMix is exploited to simulate data from diverse domains during teaching the meta-adjuster to adapt to the upcoming agnostic target domains. This learning mechanism urges the model to generalize to different agnostic target domains via adjusting the model without training. Extensive experiments demonstrate the effectiveness of our proposed method. Code is available at: https://github.com/MetaVisionLab/DDG  ( 2 min )
    Representing Polymers as Periodic Graphs with Learned Descriptors for Accurate Polymer Property Predictions. (arXiv:2205.13757v1 [cond-mat.mtrl-sci])
    One of the grand challenges of utilizing machine learning for the discovery of innovative new polymers lies in the difficulty of accurately representing the complex structures of polymeric materials. Although a wide array of hand-designed polymer representations have been explored, there has yet to be an ideal solution for how to capture the periodicity of polymer structures, and how to develop polymer descriptors without the need for human feature design. In this work, we tackle these problems through the development of our periodic polymer graph representation. Our pipeline for polymer property predictions is comprised of our polymer graph representation that naturally accounts for the periodicity of polymers, followed by a message-passing neural network (MPNN) that leverages the power of graph deep learning to automatically learn chemically-relevant polymer descriptors. Across a diverse dataset of 10 polymer properties, we find that this polymer graph representation consistently outperforms hand-designed representations with a 20% average reduction in prediction error. Our results illustrate how the incorporation of chemical intuition through directly encoding periodicity into our polymer graph representation leads to a considerable improvement in the accuracy and reliability of polymer property predictions. We also demonstrate how combining polymer graph representations with message-passing neural network architectures can automatically extract meaningful polymer features that are consistent with human intuition, while outperforming human-derived features. This work highlights the advancement in predictive capability that is possible if using chemical descriptors that are specifically optimized for capturing the unique chemical structure of polymers.
    Prototype Based Classification from Hierarchy to Fairness. (arXiv:2205.13997v1 [cs.LG])
    Artificial neural nets can represent and classify many types of data but are often tailored to particular applications -- e.g., for "fair" or "hierarchical" classification. Once an architecture has been selected, it is often difficult for humans to adjust models for a new task; for example, a hierarchical classifier cannot be easily transformed into a fair classifier that shields a protected field. Our contribution in this work is a new neural network architecture, the concept subspace network (CSN), which generalizes existing specialized classifiers to produce a unified model capable of learning a spectrum of multi-concept relationships. We demonstrate that CSNs reproduce state-of-the-art results in fair classification when enforcing concept independence, may be transformed into hierarchical classifiers, or even reconcile fairness and hierarchy within a single classifier. The CSN is inspired by existing prototype-based classifiers that promote interpretability.  ( 2 min )
    Double Deep Q Networks for Sensor Management in Space Situational Awareness. (arXiv:2205.14041v1 [cs.LG])
    We present a novel Double Deep Q Network (DDQN) application to a sensor management problem in space situational awareness (SSA). Frequent launches of satellites into Earth orbit pose a significant sensor management challenge, whereby a limited number of sensors are required to detect and track an increasing number of objects. In this paper, we demonstrate the use of reinforcement learning to develop a sensor management policy for SSA. We simulate a controllable Earth-based telescope, which is trained to maximise the number of satellites tracked using an extended Kalman filter. The estimated state covariance matrices for satellites observed under the DDQN policy are greatly reduced compared to those generated by an alternate (random) policy. This work provides the basis for further advancements and motivates the use of reinforcement learning for SSA.
    Robust Counterfactual Explanations for Random Forests. (arXiv:2205.14116v1 [cs.LG])
    Counterfactual explanations describe how to modify a feature vector in order to flip the outcome of a trained classifier. Several heuristic and optimal methods have been proposed to generate these explanations. However, the robustness of counterfactual explanations when the classifier is re-trained has yet to be studied. Our goal is to obtain counterfactual explanations for random forests that are robust to algorithmic uncertainty. We study the link between the robustness of ensemble models and the robustness of base learners and frame the generation of robust counterfactual explanations as a chance-constrained optimization problem. We develop a practical method with good empirical performance and provide finite-sample and asymptotic guarantees for simple random forests of stumps. We show that existing methods give surprisingly low robustness: the validity of naive counterfactuals is below $50\%$ on most data sets and can fall to $20\%$ on large problem instances with many features. Even with high plausibility, counterfactual explanations often exhibit low robustness to algorithmic uncertainty. In contrast, our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations. Furthermore, we highlight the connection between the robustness of counterfactual explanations and the predictive importance of features.  ( 2 min )
    Towards a Unified Framework for Uncertainty-aware Nonlinear Variable Selection with Theoretical Guarantees. (arXiv:2204.07293v2 [stat.ML] UPDATED)
    We develop a simple and unified framework for nonlinear variable selection that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc). In particular, for a learned nonlinear model $f(\mathbf{x})$, we consider quantifying the importance of an input variable $\mathbf{x}^j$ using the integrated partial derivative $\Psi_j = \Vert \frac{\partial}{\partial \mathbf{x}^j} f(\mathbf{x})\Vert^2_{P_\mathcal{X}}$. We then (1) provide a principled approach for quantifying variable selection uncertainty by deriving its posterior distribution, and (2) show that the approach is generalizable even to non-differentiable models such as tree ensembles. Rigorous Bayesian nonparametric theorems are derived to guarantee the posterior consistency and asymptotic uncertainty of the proposed approach. Extensive simulations and experiments on healthcare benchmark datasets confirm that the proposed algorithm outperforms existing classic and recent variable selection methods.  ( 2 min )
    Semantic Exploration from Language Abstractions and Pretrained Representations. (arXiv:2204.05080v2 [cs.LG] UPDATED)
    Effective exploration is a challenge in reinforcement learning (RL). Novelty-based exploration methods can suffer in high-dimensional state spaces, such as continuous partially-observable 3D environments. We address this challenge by defining novelty using semantically meaningful state abstractions, which can be found in learned representations shaped by natural language. In particular, we evaluate vision-language representations, pretrained on natural image captioning datasets. We show that these pretrained representations drive meaningful, task-relevant exploration and improve performance on 3D simulated environments. We also characterize why and how language provides useful abstractions for exploration by considering the impacts of using representations from a pretrained model, a language oracle, and several ablations. We demonstrate the benefits of our approach in two very different task domains -- one that stresses the identification and manipulation of everyday objects, and one that requires navigational exploration in an expansive world -- as well as two popular deep RL algorithms: Impala and R2D2. Our results suggest that using language-shaped representations could improve exploration for various algorithms and agents in challenging environments.  ( 2 min )
    Big-means: Less is More for K-means Clustering. (arXiv:2204.07485v2 [cs.LG] UPDATED)
    K-means clustering plays a vital role in data mining. However, its performance drastically drops when applied to huge amounts of data. We propose a new heuristic that is built on the basis of regular K-means for faster and more accurate big data clustering using the "less is more" and decomposition approaches. The main advantage of the proposed algorithm is that it naturally turns the K-means local search into global one through the process of decomposition of the minimum sum-of-squares clustering (MSSC) problem. On one hand, decomposition of the MSSC problem into smaller subproblems reduces the computational complexity and allows for their parallel processing. On the other hand, the MSSC decomposition provides a new method for the natural data-driven shaking of the incumbent solution while introducing a new neighborhood structure for the solution of the MSSC problem. The proposed algorithm is scalable, fast, and accurate. The scalability of the algorithm can be easily adjusted by choosing the appropriate number of subproblems and their size. In our experiments it outperforms all recent state-of-the-art algorithms for the MSSC in both in time and the solution quality.  ( 2 min )
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets. (arXiv:2204.07615v2 [cs.LG] UPDATED)
    The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.  ( 2 min )
    Fairness and Welfare Quantification for Regret in Multi-Armed Bandits. (arXiv:2205.13930v1 [cs.LG])
    We extend the notion of regret with a welfarist perspective. Focussing on the classic multi-armed bandit (MAB) framework, the current work quantifies the performance of bandit algorithms by applying a fundamental welfare function, namely the Nash social welfare (NSW) function. This corresponds to equating algorithm's performance to the geometric mean of its expected rewards and leads us to the study of Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance. Since NSW is known to satisfy fairness axioms, our approach complements the utilitarian considerations of average (cumulative) regret, wherein the algorithm is evaluated via the arithmetic mean of its expected rewards. This work develops an algorithm that, given the horizon of play $T$, achieves a Nash regret of $O \left( \sqrt{\frac{{k \log T}}{T}} \right)$, here $k$ denotes the number of arms in the MAB instance. Since, for any algorithm, the Nash regret is at least as much as its average regret (the AM-GM inequality), the known lower bound on average regret holds for Nash regret as well. Therefore, our Nash regret guarantee is essentially tight. In addition, we develop an anytime algorithm with a Nash regret guarantee of $O \left( \sqrt{\frac{{k\log T}}{T}} \log T \right)$.  ( 2 min )
    ProtoFSSL: Federated Semi-Supervised Learning with Prototype-based Consistency Regularization. (arXiv:2205.13921v1 [cs.LG])
    With the increasing computing power of edge devices, Federated Learning (FL) emerges to enable model training without privacy concerns. The majority of existing studies assume the data are fully labeled on the client side. In practice, however, the amount of labeled data is often limited. Recently, federated semi-supervised learning (FSSL) is explored as a way to effectively utilize unlabeled data during training. In this work, we propose ProtoFSSL, a novel FSSL approach based on prototypical networks. In ProtoFSSL, clients share knowledge with each other via lightweight prototypes, which prevents the local models from diverging. For computing loss on unlabeled data, each client creates accurate pseudo-labels based on shared prototypes. Jointly with labeled data, the pseudo-labels provide training signals for local prototypes. Compared to a FSSL approach based on weight sharing, the prototype-based inter-client knowledge sharing significantly reduces both communication and computation costs, enabling more frequent knowledge sharing between more clients for better accuracy. In multiple datasets, ProtoFSSL results in higher accuracy compared to the recent FSSL methods with and without knowledge sharing, such as FixMatch, FedRGD, and FedMatch. On SVHN dataset, ProtoFSSL performs comparably to fully supervised FL methods.  ( 2 min )
    GALAXY: Graph-based Active Learning at the Extreme. (arXiv:2202.01402v2 [cs.LG] UPDATED)
    Active learning is a label-efficient approach to train highly effective models while interactively selecting only small subsets of unlabelled data for labelling and training. In "open world" settings, the classes of interest can make up a small fraction of the overall dataset -- most of the data may be viewed as an out-of-distribution or irrelevant class. This leads to extreme class-imbalance, and our theory and methods focus on this core issue. We propose a new strategy for active learning called GALAXY (Graph-based Active Learning At the eXtrEme), which blends ideas from graph-based active learning and deep learning. GALAXY automatically and adaptively selects more class-balanced examples for labeling than most other methods for active learning. Our theory shows that GALAXY performs a refined form of uncertainty sampling that gathers a much more class-balanced dataset than vanilla uncertainty sampling. Experimentally, we demonstrate GALAXY's superiority over existing state-of-art deep active learning algorithms in unbalanced vision classification settings generated from popular datasets.  ( 2 min )
    Rethinking ValueDice: Does It Really Improve Performance?. (arXiv:2202.02468v2 [cs.LG] UPDATED)
    Since the introduction of GAIL, adversarial imitation learning (AIL) methods attract lots of research interests. Among these methods, ValueDice has achieved significant improvements: it beats the classical approach Behavioral Cloning (BC) under the offline setting, and it requires fewer interactions than GAIL under the online setting. Are these improvements benefited from more advanced algorithm designs? We answer this question by the following conclusions. First, we show that ValueDice could reduce to BC under the offline setting. Second, we verify that overfitting exists and regularization matters in the low-data regime. Specifically, we demonstrate that with weight decay, BC also nearly matches the expert performance as ValueDice does. The first two claims explain the superior offline performance of ValueDice. Third, we establish that ValueDice does not work when the expert trajectory is subsampled. Instead, the mentioned success of ValueDice holds when the expert trajectory is complete, in which ValueDice is closely related to BC that performs well as mentioned. Finally, we discuss the implications of our research for imitation learning studies beyond ValueDice.  ( 2 min )
    Minimax Regret for Cascading Bandits. (arXiv:2203.12577v2 [cs.LG] UPDATED)
    Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.  ( 2 min )
    Intelligent Transportation Systems' Orchestration: Lessons Learned & Potential Opportunities. (arXiv:2205.14040v1 [cs.NI])
    The growing deployment efforts of 5G networks globally has led to the acceleration of the businesses/services' digital transformation. This growth has led to the need for new communication technologies that will promote this transformation. 6G is being proposed as the set of technologies and architectures that will achieve this target. Among the main use cases that have emerged for 5G networks and will continue to play a pivotal role in 6G networks is that of Intelligent Transportation Systems (ITSs). With all the projected benefits of developing and deploying efficient and effective ITSs comes a group of unique challenges that need to be addressed. One prominent challenge is ITS orchestration due to the various supporting technologies and heterogeneous networks used to offer the desired ITS applications/services. To that end, this paper focuses on the ITS orchestration challenge in detail by highlighting the related previous works from the literature and listing the lessons learned from current ITS deployment orchestration efforts. It also presents multiple potential data-driven research opportunities in which paradigms such as reinforcement learning and federated learning can be deployed to offer effective and efficient ITS orchestration.
    TimeREISE: Time-series Randomized Evolving Input Sample Explanation. (arXiv:2202.07952v2 [cs.LG] UPDATED)
    Deep neural networks are one of the most successful classifiers across different domains. However, due to their limitations concerning interpretability their use is limited in safety critical context. The research field of explainable artificial intelligence addresses this problem. However, most of the interpretability methods are aligned to the image modality by design. The paper introduces TimeREISE a model agnostic attribution method specifically aligned to success in the context of time series classification. The method shows superior performance compared to existing approaches concerning different well-established measurements. TimeREISE is applicable to any time series classification network, its runtime does not scale in a linear manner concerning the input shape and it does not rely on prior data knowledge.  ( 2 min )
    Lifting the Information Ratio: An Information-Theoretic Analysis of Thompson Sampling for Contextual Bandits. (arXiv:2205.13924v1 [cs.LG])
    We study the Bayesian regret of the renowned Thompson Sampling algorithm in contextual bandits with binary losses and adversarially-selected contexts. We adapt the information-theoretic perspective of Russo and Van Roy [2016] to the contextual setting by introducing a new concept of information ratio based on the mutual information between the unknown model parameter and the observed loss. This allows us to bound the regret in terms of the entropy of the prior distribution through a remarkably simple proof, and with no structural assumptions on the likelihood or the prior. The extension to priors with infinite entropy only requires a Lipschitz assumption on the log-likelihood. An interesting special case is that of logistic bandits with d-dimensional parameters, K actions, and Lipschitz logits, for which we provide a $\widetilde{O}(\sqrt{dKT})$ regret upper-bound that does not depend on the smallest slope of the sigmoid link function.
    AANG: Automating Auxiliary Learning. (arXiv:2205.14082v1 [cs.LG])
    When faced with data-starved or highly complex end-tasks, it is commonplace for machine learning practitioners to introduce auxiliary objectives as supplementary learning signals. Whilst much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious hand-design. Intuitions about how and when these objectives improve end-task performance have also had limited theoretical backing. In this work, we present an approach for automatically generating a suite of auxiliary objectives. We achieve this by deconstructing existing objectives within a novel unified taxonomy, identifying connections between them, and generating new ones based on the uncovered structure. Next, we theoretically formalize widely-held intuitions about how auxiliary learning improves generalization of the end-task. This leads us to a principled and efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task. With natural language processing (NLP) as our domain of study, we empirically verify that our automated auxiliary learning pipeline leads to strong improvements over competitive baselines across continued training experiments on a pre-trained model on 5 NLP end-tasks.
    MyoSuite -- A contact-rich simulation suite for musculoskeletal motor control. (arXiv:2205.13600v1 [cs.RO])
    Embodied agents in continuous control domains have had limited exposure to tasks allowing to explore musculoskeletal properties that enable agile and nimble behaviors in biological beings. The sophistication behind neuro-musculoskeletal control can pose new challenges for the motor learning community. At the same time, agents solving complex neural control problems allow impact in fields such as neuro-rehabilitation, as well as collaborative-robotics. Human biomechanics underlies complex multi-joint-multi-actuator musculoskeletal systems. The sensory-motor system relies on a range of sensory-contact rich and proprioceptive inputs that define and condition muscle actuation required to exhibit intelligent behaviors in the physical world. Current frameworks for musculoskeletal control do not support physiological sophistication of the musculoskeletal systems along with physical world interaction capabilities. In addition, they are neither embedded in complex and skillful motor tasks nor are computationally effective and scalable to study large-scale learning paradigms. Here, we present MyoSuite -- a suite of physiologically accurate biomechanical models of elbow, wrist, and hand, with physical contact capabilities, which allow learning of complex and skillful contact-rich real-world tasks. We provide diverse motor-control challenges: from simple postural control to skilled hand-object interactions such as turning a key, twirling a pen, rotating two balls in one hand, etc. By supporting physiological alterations in musculoskeletal geometry (tendon transfer), assistive devices (exoskeleton assistance), and muscle contraction dynamics (muscle fatigue, sarcopenia), we present real-life tasks with temporal changes, thereby exposing realistic non-stationary conditions in our tasks which most continuous control benchmarks lack.
    Fast variable selection makes scalable Gaussian process BSS-ANOVA a speedy and accurate choice for tabular and time series regression. (arXiv:2205.13676v1 [cs.LG])
    Gaussian processes (GPs) are non-parametric regression engines with a long history. They are often overlooked in modern machine learning contexts because of scalability issues: regression for traditional GP kernels are $\mathcal{O}(N^3)$ where $N$ is the size of the dataset. One of a number of scalable GP approaches is the Karhunen-Lo\'eve (KL) decomposed kernel BSS-ANOVA, developed in 2009. It is $\mathcal{O}(NP)$ in training and $\mathcal{O}(P)$ per point in prediction, where $P$ is the number of terms in the ANOVA / KL expansion. A new method of forward variable selection, quickly and effectively limits the number of terms, yielding a method with competitive accuracies, training and inference times for large tabular datasets. The new algorithm balances model fidelity with model complexity using Bayesian and Akaike information criteria (BIC/AIC). The inference speed and accuracy makes the method especially useful for modeling dynamic systems in a model-free manner, by modeling the derivative in a dynamic system as a static problem, then integrating the learned dynamics using a high-order scheme. The methods are demonstrated on a `Susceptible, Infected, Recovered' (SIR) toy problem, with the transmissibility used as forcing function, along with the `Cascaded Tanks' benchmark dataset. Comparisons on the static prediction of derivatives are made with a Random Forest and Residual Neural Network, while for the timeseries prediction comparisons are made with LSTM and GRU recurrent neural networks. The GP outperforms the other methods in all modeling tasks on accuracy, while (in the case of the neural networks) performing many orders of magnitude fewer calculations. For the SIR test, which involved prediction for a set of forcing functions qualitatively different from those appearing in the training set, the GP captured the correct dynamics while the neural networks failed to do so.
    Learning Dense Reward with Temporal Variant Self-Supervision. (arXiv:2205.10431v2 [cs.LG] UPDATED)
    Rewards play an essential role in reinforcement learning. In contrast to rule-based game environments with well-defined reward functions, complex real-world robotic applications, such as contact-rich manipulation, lack explicit and informative descriptions that can directly be used as a reward. Previous effort has shown that it is possible to algorithmically extract dense rewards directly from multimodal observations. In this paper, we aim to extend this effort by proposing a more efficient and robust way of sampling and learning. In particular, our sampling approach utilizes temporal variance to simulate the fluctuating state and action distribution of a manipulation task. We then proposed a network architecture for self-supervised learning to better incorporate temporal information in latent representations. We tested our approach in two experimental setups, namely joint-assembly and door-opening. Preliminary results show that our approach is effective and efficient in learning dense rewards, and the learned rewards lead to faster convergence than baselines.
    Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks. (arXiv:2201.02143v2 [cs.LG] UPDATED)
    Classification of long sequential data is an important Machine Learning task and appears in many application scenarios. Recurrent Neural Networks, Transformers, and Convolutional Neural Networks are three major techniques for learning from sequential data. Among these methods, Temporal Convolutional Networks (TCNs) which are scalable to very long sequences have achieved remarkable progress in time series regression. However, the performance of TCNs for sequence classification is not satisfactory because they use a skewed connection protocol and output classes at the last position. Such asymmetry restricts their performance for classification which depends on the whole sequence. In this work, we propose a symmetric multi-scale architecture called Circular Dilated Convolutional Neural Network (CDIL-CNN), where every position has an equal chance to receive information from other positions at the previous layers. Our model gives classification logits in all positions, and we can apply a simple ensemble learning to achieve a better decision. We have tested CDIL-CNN on various long sequential datasets. The experimental results show that our method has superior performance over many state-of-the-art approaches.
    Quark: Controllable Text Generation with Reinforced Unlearning. (arXiv:2205.13636v1 [cs.CL])
    Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property, while not straying too far from the original model. Quark alternates between (i) collecting samples with the current language model, (ii) sorting them into quantiles based on reward, with each quantile identified by a reward token prepended to the language model's input, and (iii) using a standard language modeling loss on samples from each quantile conditioned on its reward token, while remaining nearby the original language model via a KL-divergence penalty. By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO (Schulman et al. 2017), while relying only on standard language modeling primitives.  ( 2 min )
    U-NO: U-shaped Neural Operators. (arXiv:2204.11127v2 [cs.LG] UPDATED)
    Neural operators generalize classical neural networks to maps between infinite-dimensional spaces, e.g. function spaces. Prior works on neural operators proposed a series of novel architectures to learn such maps and demonstrated unprecedented success in learning solution operators of partial differential equations. Due to their close proximity to fully connected architectures, these models mainly suffer from high memory usage and are generally limited to shallow deep learning models. In this paper, we propose U-shaped Neural Operator (U-NO), a U-shaped memory enhanced architecture that allows for deeper neural operators. U-NOs exploit the problem structures in function predictions and demonstrate fast training, data efficiency, and robustness with respect to hyperparameters choices. We study the performance of U-NO on PDE benchmarks, namely, Darcy's flow law and the Navier-Stokes equations. We show that U-NO results in an average of 14% and 34% prediction improvement on Darcy's flow and turbulent Navier-Stokes equations, respectively, over the state of art. On Navier-Stokes 3D spatio-temporal operator learning task, we show U-NO provides 40% improvement over the state of art methods.
    Auditing Differential Privacy in High Dimensions with the Kernel Quantum R\'enyi Divergence. (arXiv:2205.13941v1 [cs.LG])
    Differential privacy (DP) is the de facto standard for private data release and private machine learning. Auditing black-box DP algorithms and mechanisms to certify whether they satisfy a certain DP guarantee is challenging, especially in high dimension. We propose relaxations of differential privacy based on new divergences on probability distributions: the kernel R\'enyi divergence and its regularized version. We show that the regularized kernel R\'enyi divergence can be estimated from samples even in high dimensions, giving rise to auditing procedures for $\varepsilon$-DP, $(\varepsilon,\delta)$-DP and $(\alpha,\varepsilon)$-R\'enyi DP.
    Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power. (arXiv:2205.13863v1 [cs.LG])
    It is well-known that modern neural networks are vulnerable to adversarial examples. To mitigate this problem, a series of robust learning algorithms have been proposed. However, although the robust training error can be near zero via some methods, all existing algorithms lead to a high robust generalization error. In this paper, we provide a theoretical understanding of this puzzling phenomenon from the perspective of expressive power for deep neural networks. Specifically, for binary classification problems with well-separated data, we show that, for ReLU networks, while mild over-parameterization is sufficient for high robust training accuracy, there exists a constant robust generalization gap unless the size of the neural network is exponential in the data dimension $d$. Even if the data is linear separable, which means achieving low clean generalization error is easy, we can still prove an $\exp({\Omega}(d))$ lower bound for robust generalization. Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$). Nonetheless, we also have a lower bound that grows exponentially with respect to $k$ -- the curse of dimensionality is inevitable. By demonstrating an exponential separation between the network size for achieving low robust training and generalization error, our results reveal that the hardness of robust generalization may stem from the expressive power of practical models.
    Automated Dynamic Algorithm Configuration. (arXiv:2205.13881v1 [cs.AI])
    The performance of an algorithm often critically depends on its parameter configuration. While a variety of automated algorithm configuration methods have been proposed to relieve users from the tedious and error-prone task of manually tuning parameters, there is still a lot of untapped potential as the learned configuration is static, i.e., parameter settings remain fixed throughout the run. However, it has been shown that some algorithm parameters are best adjusted dynamically during execution, e.g., to adapt to the current part of the optimization landscape. Thus far, this is most commonly achieved through hand-crafted heuristics. A promising recent alternative is to automatically learn such dynamic parameter adaptation policies from data. In this article, we give the first comprehensive account of this new field of automated dynamic algorithm configuration (DAC), present a series of recent advances, and provide a solid foundation for future research in this field. Specifically, we (i) situate DAC in the broader historical context of AI research; (ii) formalize DAC as a computational problem; (iii) identify the methods used in prior-art to tackle this problem; (iv) conduct empirical case studies for using DAC in evolutionary optimization, AI planning, and machine learning.
    Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks. (arXiv:2109.08844v2 [stat.ML] UPDATED)
    We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.
    Neural Basis Models for Interpretability. (arXiv:2205.14120v1 [cs.LG])
    Due to the widespread use of complex machine learning models in real-world applications, it is becoming critical to explain model predictions. However, these models are typically black-box deep neural networks, explained post-hoc via methods with known faithfulness limitations. Generalized Additive Models (GAMs) are an inherently interpretable class of models that address this limitation by learning a non-linear shape function for each feature separately, followed by a linear model on top. However, these models are typically difficult to train, require numerous parameters, and are difficult to scale. We propose an entirely new subfamily of GAMs that utilizes basis decomposition of shape functions. A small number of basis functions are shared among all features, and are learned jointly for a given task, thus making our model scale much better to large-scale data with high-dimensional features, especially when features are sparse. We propose an architecture denoted as the Neural Basis Model (NBM) which uses a single neural network to learn these bases. On a variety of tabular and image datasets, we demonstrate that for interpretable machine learning, NBMs are the state-of-the-art in accuracy, model size, and, throughput and can easily model all higher-order feature interactions.
    Hybrid training of optical neural networks. (arXiv:2203.11207v2 [cs.LG] UPDATED)
    Optical neural networks are emerging as a promising type of machine learning hardware capable of energy-efficient, parallel computation. Today's optical neural networks are mainly developed to perform optical inference after in silico training on digital simulators. However, various physical imperfections that cannot be accurately modelled may lead to the notorious reality gap between the digital simulator and the physical system. To address this challenge, we demonstrate hybrid training of optical neural networks where the weight matrix is trained with neuron activation functions computed optically via forward propagation through the network. We examine the efficacy of hybrid training with three different networks: an optical linear classifier, a hybrid opto-electronic network, and a complex-valued optical network. We perform a comparative study to in silico training, and our results show that hybrid training is robust against different kinds of static noise. Our platform-agnostic hybrid training scheme can be applied to a wide variety of optical neural networks, and this work paves the way towards advanced all-optical training in machine intelligence.  ( 2 min )
    Generalization Bounds for Gradient Methods via Discrete and Continuous Prior. (arXiv:2205.13799v1 [cs.LG])
    Proving algorithm-dependent generalization error bounds for gradient-type optimization methods has attracted significant attention recently in learning theory. However, most existing trajectory-based analyses require either restrictive assumptions on the learning rate (e.g., fast decreasing learning rate), or continuous injected noise (such as the Gaussian noise in Langevin dynamics). In this paper, we introduce a new discrete data-dependent prior to the PAC-Bayesian framework, and prove a high probability generalization bound of order $O(\frac{1}{n}\cdot \sum_{t=1}^T(\gamma_t/\varepsilon_t)^2\left\|{\mathbf{g}_t}\right\|^2)$ for Floored GD (i.e. a version of gradient descent with precision level $\varepsilon_t$), where $n$ is the number of training samples, $\gamma_t$ is the learning rate at step $t$, $\mathbf{g}_t$ is roughly the difference of the gradient computed using all samples and that using only prior samples. $\left\|{\mathbf{g}_t}\right\|$ is upper bounded by and and typical much smaller than the gradient norm $\left\|{\nabla f(W_t)}\right\|$. We remark that our bound holds for nonconvex and nonsmooth scenarios. Moreover, our theoretical results provide numerically favorable upper bounds of testing errors (e.g., $0.037$ on MNIST). Using a similar technique, we can also obtain new generalization bounds for certain variants of SGD. Furthermore, we study the generalization bounds for gradient Langevin Dynamics (GLD). Using the same framework with a carefully constructed continuous prior, we show a new high probability generalization bound of order $O(\frac{1}{n} + \frac{L^2}{n^2}\sum_{t=1}^T(\gamma_t/\sigma_t)^2)$ for GLD. The new $1/n^2$ rate is due to the concentration of the difference between the gradient of training samples and that of the prior.
    Reinforcement Learning Approach for Mapping Applications to Dataflow-Based Coarse-Grained Reconfigurable Array. (arXiv:2205.13675v1 [cs.AR])
    The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the correct execution of the program. This creates an optimization problem with a vast and sparse search space for which finding a mapping manually is impractical because it requires expertise and knowledge of the SE micro-architecture. In this work we propose a Reinforcement Learning framework with Global Graph Attention (GGA) module and output masking of invalid placements to find and optimize instruction schedules. We use Proximal Policy Optimization in order to train a model which places operations into the SE tiles based on a reward function that models the SE device and its constraints. The GGA module consists of a graph neural network and an attention module. The graph neural network creates embeddings of the SDFs and the attention block is used to model sequential operation placement. We show results on how certain workloads are mapped to the SE and the factors affecting mapping quality. We find that the addition of GGA, on average, finds 10% better instruction schedules in terms of total clock cycles taken and masking improves reward obtained by 20%.  ( 2 min )
    EvenNet: Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks. (arXiv:2205.13892v1 [cs.LG])
    Graph Neural Networks (GNNs) have received extensive research attention for their promising performance in graph machine learning. Despite their extraordinary predictive accuracy, existing approaches, such as GCN and GPRGNN, are not robust in the face of homophily changes on test graphs, rendering these models vulnerable to graph structural attacks and with limited capacity in generalizing to graphs of varied homophily levels. Although many methods have been proposed to improve the robustness of GNN models, most of these techniques are restricted to the spatial domain and employ complicated defense mechanisms, such as learning new graph structures or calculating edge attentions. In this paper, we study the problem of designing simple and robust GNN models in the spectral domain. We propose EvenNet, a spectral GNN corresponding to an even-polynomial graph filter. Based on our theoretical analysis in both spatial and spectral domains, we demonstrate that EvenNet outperforms full-order models in generalizing across homophilic and heterophilic graphs, implying that ignoring odd-hop neighbors improves the robustness of GNNs. We conduct experiments on both synthetic and real-world datasets to demonstrate the effectiveness of EvenNet. Notably, EvenNet outperforms existing defense models against structural attacks without introducing additional computational costs and maintains competitiveness in traditional node classification tasks on homophilic and heterophilic graphs.
    Scalable Interpretability via Polynomials. (arXiv:2205.14108v1 [cs.LG])
    Generalized Additive Models (GAMs) have quickly become the leading choice for fully-interpretable machine learning. However, unlike uninterpretable methods such as DNNs, they lack expressive power and easy scalability, and are hence not a feasible alternative for real-world tasks. We present a new class of GAMs that use tensor rank decompositions of polynomials to learn powerful, $\textit{fully-interpretable}$ models. Our approach, titled Scalable Polynomial Additive Models (SPAM) is effortlessly scalable and models $\textit{all}$ higher-order feature interactions without a combinatorial parameter explosion. SPAM outperforms all current interpretable approaches, and matches DNN/XGBoost performance on a series of real-world benchmarks with up to hundreds of thousands of features. We demonstrate by human subject evaluations that SPAMs are demonstrably more interpretable in practice, and are hence an effortless replacement for DNNs for creating interpretable and high-performance systems suitable for large-scale machine learning.
    Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation. (arXiv:2202.01336v4 [cs.LG] UPDATED)
    Neural networks (NNs) are often leveraged to represent structural similarities of potential outcomes (POs) of different treatment groups to obtain better finite-sample estimates of treatment effects. However, despite their wide use, existing works handcraft treatment-specific (sub)network architectures for representing various POs, which limit their applicability and generalizability. To remedy these issues, we develop a framework called Transformers as Treatment Effect Estimators (TransTEE) where attention layers govern interactions among treatments and covariates to exploit structural similarities of POs for confounding control. Using this framework, through extensive experiments, we show that TransTEE can: (1) serve as a general-purpose treatment effect estimator to significantly outperform competitive baselines on a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments.) and is applicable both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, interpretability in covariate adjustment, and real-world utility in debugging pre-trained language models.
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v2 [cs.LG] UPDATED)
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results
    Efficient Forecasting of Large Scale Hierarchical Time Series via Multilevel Clustering. (arXiv:2205.14104v1 [cs.LG])
    We propose a novel approach to the problem of clustering hierarchically aggregated time-series data, which has remained an understudied problem though it has several commercial applications. We first group time series at each aggregated level, while simultaneously leveraging local and global information. The proposed method can cluster hierarchical time series (HTS) with different lengths and structures. For common two-level hierarchies, we employ a combined objective for local and global clustering over spaces of discrete probability measures, using Wasserstein distance coupled with Soft-DTW divergence. For multi-level hierarchies, we present a bottom-up procedure that progressively leverages lower-level information for higher-level clustering. Our final goal is to improve both the accuracy and speed of forecasts for a larger number of HTS needed for a real-world application. To attain this goal, each time series is first assigned the forecast for its cluster representative, which can be considered as a "shrinkage prior" for the set of time series it represents. Then this base forecast can be quickly fine-tuned to adjust to the specifics of that time series. We empirically show that our method substantially improves performance in terms of both speed and accuracy for large-scale forecasting tasks involving much HTS.
    A Multilabel Classification Framework for Approximate Nearest Neighbor Search. (arXiv:1910.08322v4 [cs.LG] UPDATED)
    Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees.
    X-ViT: High Performance Linear Vision Transformer without Softmax. (arXiv:2205.13805v1 [cs.CV])
    Vision transformers have become one of the most important models for computer vision tasks. Although they outperform prior works, they require heavy computational resources on a scale that is quadratic to the number of tokens, $N$. This is a major drawback of the traditional self-attention (SA) algorithm. Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity. The main approach of this work is to eliminate nonlinearity from the original SA. We factorize the matrix multiplication of the SA mechanism without complicated linear approximation. By modifying only a few lines of code from the original SA, the proposed models outperform most transformer-based models on image classification and dense prediction tasks on most capacity regimes.
    Bayesian Robust Graph Contrastive Learning. (arXiv:2205.14109v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely used to learn node representations and with outstanding performance on various tasks such as node classification. However, noise, which inevitably exists in real-world graph data, would considerably degrade the performance of GNNs as the noise is easily propagated via the graph structure. In this work, we propose a novel and robust method, Bayesian Robust Graph Contrastive Learning (BRGCL), which trains a GNN encoder to learn robust node representations. The BRGCL encoder is a completely unsupervised encoder. Two steps are iteratively executed at each epoch of training the BRGCL encoder: (1) estimating confident nodes and computing robust cluster prototypes of node representations through a novel Bayesian nonparametric method; (2) prototypical contrastive learning between the node representations and the robust cluster prototypes. Experiments on public and large-scale benchmarks demonstrate the superior performance of BRGCL and the robustness of the learned node representations. The code of BRGCL is available at \url{https://github.com/BRGCL-code/BRGCL-code}.
    Safe Value Functions. (arXiv:2105.12204v3 [eess.SY] UPDATED)
    Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions: value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.
    Approximating the Manifold Structure of Attributed Incentive Salience from Large Scale Behavioural Data. A Representation Learning Approach Based on Artificial Neural Networks. (arXiv:2108.01724v2 [cs.LG] UPDATED)
    Incentive salience attribution can be understood as a psychobiological mechanism ascribing relevance to potentially rewarding objects and actions. Despite being an important component of the motivational process guiding our everyday behaviour its study in naturalistic contexts is not straightforward. Here we propose a methodology based on artificial neural networks (ANNs) for approximating latent states produced by this process in situations where large volumes of behavioural data are available but no experimental control is possible. Leveraging knowledge derived from theoretical and computational accounts of incentive salience attribution we designed an ANN for estimating duration and intensity of future interactions between individuals and a series of video games in a large-scale ($N> 3 \times 10^6$) longitudinal dataset. We found video games to be the ideal context for developing such methodology due to their reliance on reward mechanics and their ability to provide ecologically robust behavioural measures at scale. When compared to competing approaches our methodology produces representations that are better suited for predicting the intensity future behaviour and approximating some functional properties of attributed incentive salience. We discuss our findings with reference to the adopted theoretical and computational frameworks and suggest how our methodology could be an initial step for estimating attributed incentive salience in large scale behavioural studies.
    A Model Predictive Control Functional Continuous Time Bayesian Network for Self-Management of Multiple Chronic Conditions. (arXiv:2205.13639v1 [cs.LG])
    Multiple chronic conditions (MCC) are one of the biggest challenges of modern times. The evolution of MCC follows a complex stochastic process that is influenced by a variety of risk factors, ranging from pre-existing conditions to modifiable lifestyle behavioral factors (e.g. diet, exercise habits, tobacco use, alcohol use, etc.) to non-modifiable socio-demographic factors (e.g., age, gender, education, marital status, etc.). People with MCC are at an increased risk of new chronic conditions and mortality. This paper proposes a model predictive control functional continuous time Bayesian network, an online recursive method to examine the impact of various lifestyle behavioral changes on the emergence trajectories of MCC and generate strategies to minimize the risk of progression of chronic conditions in individual patients. The proposed method is validated based on the Cameron county Hispanic cohort (CCHC) dataset, which has a total of 385 patients. The dataset examines the emergence of 5 chronic conditions (diabetes, obesity, cognitive impairment, hyperlipidemia, and hypertension) based on four modifiable risk factors representing lifestyle behaviors (diet, exercise habits, tobacco use, alcohol use) and four non-modifiable risk factors, including socio-demographic information (age, gender, education, marital status). The proposed method is tested under different scenarios (e.g., age group, the prior existence of MCC), demonstrating the effective intervention strategies for improving the lifestyle behavioral risk factors to offset MCC evolution.
    Deep Ensembles for Graphs with Higher-order Dependencies. (arXiv:2205.13988v1 [cs.LG])
    Graph neural networks (GNNs) continue to achieve state-of-the-art performance on many graph learning tasks, but rely on the assumption that a given graph is a sufficient approximation of the true neighborhood structure. In the presence of higher-order sequential dependencies, we show that the tendency of traditional graph representations to underfit each node's neighborhood causes existing GNNs to generalize poorly. To address this, we propose a novel Deep Graph Ensemble (DGE), which captures neighborhood variance by training an ensemble of GNNs on different neighborhood subspaces of the same node within a higher-order network structure. We show that DGE consistently outperforms existing GNNs on semisupervised and supervised tasks on four real-world data sets with known higher-order dependencies, even under a similar parameter budget. We demonstrate that learning diverse and accurate base classifiers is central to DGE's success, and discuss the implications of these findings for future work on GNNs.
    Anonymization for Skeleton Action Recognition. (arXiv:2111.15129v2 [cs.CV] UPDATED)
    Skeleton-based action recognition attracts practitioners and researchers due to the lightweight, compact nature of datasets. Compared with RGB-video-based action recognition, skeleton-based action recognition is a safer way to protect the privacy of subjects while having competitive recognition performance. However, due to improvements in skeleton estimation algorithms as well as motion- and depth-sensors, more details of motion characteristics can be preserved in the skeleton dataset, leading to potential privacy leakage. To investigate the potential privacy leakage from skeleton datasets, we first train a classifier to categorize sensitive private information from trajectories of joints. Our preliminary experiments show that the gender classifier achieves 87% accuracy on average and the re-identification task achieves 80% accuracy on average for three baseline models: Shift-GCN, MS-G3D, and 2s-AGCN. We propose an adversarial anonymization algorithm to protect potential privacy leakage from the skeleton dataset. Experimental results show that an anonymized dataset can reduce the risk of privacy leakage while having marginal effects on action recognition performance.
    ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling. (arXiv:2203.04510v2 [cs.LG] UPDATED)
    This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle data collection strategy that uses knowledge of the variance of the reward distributions. We then introduce the Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
    MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. (arXiv:2205.13869v1 [cs.LG])
    State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm is unaware of the causal discovery step. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph prior as an inductive bias. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
    How Tempering Fixes Data Augmentation in Bayesian Neural Networks. (arXiv:2205.13900v1 [cs.LG])
    While Bayesian neural networks (BNNs) provide a sound and principled alternative to standard neural networks, an artificial sharpening of the posterior usually needs to be applied to reach comparable performance. This is in stark contrast to theory, dictating that given an adequate prior and a well-specified model, the untempered Bayesian posterior should achieve optimal performance. Despite the community's extensive efforts, the observed gains in performance still remain disputed with several plausible causes pointing at its origin. While data augmentation has been empirically recognized as one of the main drivers of this effect, a theoretical account of its role, on the other hand, is largely missing. In this work we identify two interlaced factors concurrently influencing the strength of the cold posterior effect, namely the correlated nature of augmentations and the degree of invariance of the employed model to such transformations. By theoretically analyzing simplified settings, we prove that tempering implicitly reduces the misspecification arising from modeling augmentations as i.i.d. data. The temperature mimics the role of the effective sample size, reflecting the gain in information provided by the augmentations. We corroborate our theoretical findings with extensive empirical evaluations, scaling to realistic BNNs. By relying on the framework of group convolutions, we experiment with models of varying inherent degree of invariance, confirming its hypothesized relationship with the optimal temperature.
    Towards Interpretable Natural Language Understanding with Explanations as Latent Variables. (arXiv:2011.05268v3 [cs.CL] UPDATED)
    Recently generating natural language explanations has shown very promising results in not only offering interpretable explanations but also providing additional information and supervision for prediction. However, existing approaches usually require a large set of human annotated explanations for training while collecting a large set of explanations is not only time consuming but also expensive. In this paper, we develop a general framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training. Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model. We develop a variational EM framework for optimization where an explanation generation module and an explanation-augmented prediction module are alternatively optimized and mutually enhance each other. Moreover, we further propose an explanation-based self-training method under this framework for semi-supervised learning. It alternates between assigning pseudo-labels to unlabeled data and generating new explanations to iteratively improve each other. Experiments on two natural language understanding tasks demonstrate that our framework can not only make effective predictions in both supervised and semi-supervised settings, but also generate good natural language explanation.
    Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation. (arXiv:2205.14141v1 [cs.CV])
    Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.
    AsyncFedED: Asynchronous Federated Learning with Euclidean Distance based Adaptive Weight Aggregation. (arXiv:2205.13797v1 [cs.LG])
    In an asynchronous federated learning framework, the server updates the global model once it receives an update from a client instead of waiting for all the updates to arrive as in the synchronous setting. This allows heterogeneous devices with varied computing power to train the local models without pausing, thereby speeding up the training process. However, it introduces the stale model problem, where the newly arrived update was calculated based on a set of stale weights that are older than the current global model, which may hurt the convergence of the model. In this paper, we present an asynchronous federated learning framework with a proposed adaptive weight aggregation algorithm, referred to as AsyncFedED. To the best of our knowledge this aggregation method is the first to take the staleness of the arrived gradients, measured by the Euclidean distance between the stale model and the current global model, and the number of local epochs that have been performed, into account. Assuming general non-convex loss functions, we prove the convergence of the proposed method theoretically. Numerical results validate the effectiveness of the proposed AsyncFedED in terms of the convergence rate and model accuracy compared to the existing methods for three considered tasks.
    Raising the Bar in Graph-level Anomaly Detection. (arXiv:2205.13845v1 [cs.LG])
    Graph-level anomaly detection has become a critical topic in diverse areas, such as financial fraud detection and detecting anomalous activities in social networks. While most research has focused on anomaly detection for visual data such as images, where high detection accuracies have been obtained, existing deep learning approaches for graphs currently show considerably worse performance. This paper raises the bar on graph-level anomaly detection, i.e., the task of detecting abnormal graphs in a set of graphs. By drawing on ideas from self-supervised learning and transformation learning, we present a new deep learning approach that significantly improves existing deep one-class approaches by fixing some of their known problems, including hypersphere collapse and performance flip. Experiments on nine real-world data sets involving nine techniques reveal that our method achieves an average performance improvement of 11.8% AUC compared to the best existing approach.
    RecipeRec: A Heterogeneous Graph Learning Model for Recipe Recommendation. (arXiv:2205.14005v1 [cs.IR])
    Recipe recommendation systems play an essential role in helping people decide what to eat. Existing recipe recommendation systems typically focused on content-based or collaborative filtering approaches, ignoring the higher-order collaborative signal such as relational structure information among users, recipes and food items. In this paper, we formalize the problem of recipe recommendation with graphs to incorporate the collaborative signal into recipe recommendation through graph modeling. In particular, we first present URI-Graph, a new and large-scale user-recipe-ingredient graph. We then propose RecipeRec, a novel heterogeneous graph learning model for recipe recommendation. The proposed model can capture recipe content and collaborative signal through a heterogeneous graph neural network with hierarchical attention and an ingredient set transformer. We also introduce a graph contrastive augmentation strategy to extract informative graph knowledge in a self-supervised manner. Finally, we design a joint objective function of recommendation and contrastive learning to optimize the model. Extensive experiments demonstrate that RecipeRec outperforms state-of-the-art methods for recipe recommendation. Dataset and codes are available at https://github.com/meettyj/RecipeRec.
    Maximum Likelihood Training of Implicit Nonlinear Diffusion Models. (arXiv:2205.13699v1 [cs.LG])
    Whereas diverse variations of diffusion models exist, expanding the linear diffusion into a nonlinear diffusion process is investigated only by a few works. The nonlinearity effect has been hardly understood, but intuitively, there would be more promising diffusion patterns to optimally train the generative distribution towards the data distribution. This paper introduces such a data-adaptive and nonlinear diffusion process for score-based diffusion models. The proposed Implicit Nonlinear Diffusion Model (INDM) learns the nonlinear diffusion process by combining a normalizing flow and a diffusion process. Specifically, INDM implicitly constructs a nonlinear diffusion on the \textit{data space} by leveraging a linear diffusion on the \textit{latent space} through a flow network. This flow network is the key to forming a nonlinear diffusion as the nonlinearity fully depends on the flow network. This flexible nonlinearity is what improves the learning curve of INDM to nearly MLE training, compared against the non-MLE training of DDPM++, which turns out to be a special case of INDM with the identity flow. Also, training the nonlinear diffusion empirically yields a sampling-friendly latent diffusion that the sample trajectory of INDM is closer to an optimal transport than the trajectories of previous research. In experiments, INDM achieves the state-of-the-art FID on CelebA.
    CIGMO: Categorical invariant representations in a deep generative framework. (arXiv:2205.13758v1 [cs.CV])
    Data of general object images have two most common structures: (1) each object of a given shape can be rendered in multiple different views, and (2) shapes of objects can be categorized in such a way that the diversity of shapes is much larger across categories than within a category. Existing deep generative models can typically capture either structure, but not both. In this work, we introduce a novel deep generative model, called CIGMO, that can learn to represent category, shape, and view factors from image data. The model is comprised of multiple modules of shape representations that are each specialized to a particular category and disentangled from view representation, and can be learned using a group-based weakly supervised learning method. By empirical investigation, we show that our model can effectively discover categories of object shapes despite large view variation and quantitatively supersede various previous methods including the state-of-the-art invariant clustering algorithm. Further, we show that our approach using category-specialization can enhance the learned shape representation to better perform down-stream tasks such as one-shot object identification as well as shape-view disentanglement.
    Generative Flows as a General Purpose Solution for Inverse Problems. (arXiv:2110.13285v3 [cs.CV] UPDATED)
    Due to the success of generative flows to model data distributions, they have been explored in inverse problems. Given a pre-trained generative flow, previous work proposed to minimize the 2-norm of the latent variables as a regularization term. The intuition behind it was to ensure high likelihood latent variables that produce the closest restoration. However, high-likelihood latent variables may generate unrealistic samples as we show in our experiments. We therefore propose a solver to directly produce high-likelihood reconstructions. We hypothesize that our approach could make generative flows a general purpose solver for inverse problems. Furthermore, we propose 1 x 1 coupling functions to introduce permutations in a generative flow. It has the advantage that its inverse does not require to be calculated in the generation process. Finally, we evaluate our method for denoising, deblurring, inpainting, and colorization. We observe a compelling improvement of our method over prior works.
    Deep Coding Patterns Design for Compressive Near-Infrared Spectral Classification. (arXiv:2205.14069v1 [cs.LG])
    Compressive spectral imaging (CSI) has emerged as an attractive compression and sensing technique, primarily to sense spectral regions where traditional systems result in highly costly such as in the near-infrared spectrum. Recently, it has been shown that spectral classification can be performed directly in the compressive domain, considering the amount of spectral information embedded in the measurements, skipping the reconstruction step. Consequently, the classification quality directly depends on the set of coding patterns employed in the sensing step. Therefore, this work proposes an end-to-end approach to jointly design the coding patterns used in CSI and the network parameters to perform spectral classification directly from the embedded near-infrared compressive measurements. Extensive simulation on the three-dimensional coded aperture snapshot spectral imaging (3D-CASSI) system validates that the proposed design outperforms traditional and random design in up to 10% of classification accuracy.
    Sharpness-Aware Training for Free. (arXiv:2205.14083v1 [cs.LG])
    Modern deep neural networks (DNNs) have achieved state-of-the-art performances but are typically over-parameterized. The over-parameterization may result in undesirably large generalization error in the absence of other customized training strategies. Recently, a line of research under the name of Sharpness-Aware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. However, SAM-like methods incur a two-fold computational overhead of the given base optimizer (e.g. SGD) for approximating the sharpness measure. In this paper, we propose Sharpness-Aware Training for Free, or SAF, which mitigates the sharp landscape at almost zero additional computational cost over the base optimizer. Intuitively, SAF achieves this by avoiding sudden drops in the loss in the sharp local minima throughout the trajectory of the updates of the weights. Specifically, we suggest a novel trajectory loss, based on the KL-divergence between the outputs of DNNs with the current weights and past weights, as a replacement of the SAM's sharpness measure. This loss captures the rate of change of the training loss along the model's update trajectory. By minimizing it, SAF ensures the convergence to a flat minimum with improved generalization capabilities. Extensive empirical results show that SAF minimizes the sharpness in the same way that SAM does, yielding better results on the ImageNet dataset with essentially the same computational cost as the base optimizer.
    Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms. (arXiv:2107.03863v3 [stat.ML] UPDATED)
    Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure for such models is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel Snakemake workflow, called Benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BDgraph, BiDAG, bnlearn, gCastle, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the workflow also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking study. We demonstrate the applicability of this workflow for learning Bayesian networks in five typical data scenarios. The source code and documentation is publicly available from this http URL
    Text-Based Automatic Personality Prediction Using KGrAt-Net; A Knowledge Graph Attention Network Classifier. (arXiv:2205.13780v1 [cs.CL])
    Nowadays, a tremendous amount of human communications take place on the Internet-based communication infrastructures, like social networks, email, forums, organizational communication platforms, etc. Indeed, the automatic prediction or assessment of individuals' personalities through their written or exchanged text would be advantageous to ameliorate the relationships among them. To this end, this paper aims to propose KGrAt-Net which is a Knowledge Graph Attention Network text classifier. For the first time, it applies the knowledge graph attention network to perform Automatic Personality Prediction (APP), according to the Big Five personality traits. After performing some preprocessing activities, first, it tries to acquire a knowingful representation of the knowledge behind the concepts in the input text through building its equivalent knowledge graph. A knowledge graph is a graph-based data model that formally represents the semantics of the existing concepts in the input text and models the knowledge behind them. Then, applying the attention mechanism, it efforts to pay attention to the most relevant parts of the graph to predict the personality traits of the input text. The results demonstrated that KGrAt-Net considerably improved the personality prediction accuracies. Furthermore, KGrAt-Net also uses the knowledge graphs' embeddings to enrich the classification, which makes it even more accurate in APP.
    Client Selection in Nonconvex Federated Learning: Improved Convergence Analysis for Optimal Unbiased Sampling Strategy. (arXiv:2205.13925v1 [cs.LG])
    Federated learning (FL) is a distributed machine learning paradigm that selects a subset of clients to participate in training to reduce communication burdens. However, partial client participation in FL causes \emph{objective inconsistency}, which can hinder the convergence, while this objective inconsistency has not been analyzed in existing studies on sampling methods. To tackle this issue, we propose an improved analysis method that focuses on the convergence behavior of the practical participated client's objective. Moreover, based on our convergence analysis, we give a novel unbiased sampling strategy, i.e., FedSRC-D, whose sampling probability is proportional to the client's gradient diversity and local variance. FedSRC-D is provable the optimal unbiased sampling in non-convex settings for non-IID FL with respect to the given bounds. Specifically, FedSRC-D achieves $\mathop{O}(\frac{G^2}{\epsilon^2}+\frac{1}{\epsilon^{2/3}})$ higher than SOTA convergence rate of FedAvg, and $\mathop{O}(\frac{G^2}{\epsilon^2})$ higher than other unbiased sampling methods. We corroborate our results with experiments on both synthetic and real data sets.
    Guided Exploration of Data Summaries. (arXiv:2205.13956v1 [cs.LG])
    Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize Eda4Sum, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. EdA4Sum generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) Top1Sum which chooses the most useful summary at each step; (ii) RLSum which trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.
    Contrastive Siamese Network for Semi-supervised Speech Recognition. (arXiv:2205.14054v1 [cs.LG])
    This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A c-siam network with 450M parameters achieves competitive results compared to the state-of-the-art networks with 600M parameters.
    Generalizing Brain Decoding Across Subjects with Deep Learning. (arXiv:2205.14102v1 [cs.LG])
    Decoding experimental variables from brain imaging data is gaining popularity, with applications in brain-computer interfaces and the study of neural representations. Decoding is typically subject-specific and does not generalise well over subjects. Here, we investigate ways to achieve cross-subject decoding. We used magnetoencephalography (MEG) data where 15 subjects viewed 118 different images, with 30 examples per image. Training on the entire 1s window following the presentation of each image, we experimented with an adaptation of the WaveNet architecture for classification. We also investigated the use of subject embedding to aid learning of subject variability in the group model. We show that deep learning and subject embedding are crucial to closing the performance gap between subject and group-level models. Importantly group models outperform subject models when tested on an unseen subject with little available data. The potential of such group modelling is even higher with bigger datasets. Furthermore, we demonstrate the use of permutation feature importance to gain insight into the spatio-temporal and spectral information encoded in the models, enabling better physiological interpretation. All experimental code is available at https://github.com/ricsinaruto/MEG-group-decode.
    Finite mixture of skewed sub-Gaussian stable distributions. (arXiv:2205.14067v1 [stat.ME])
    We propose the finite mixture of skewed sub-Gaussian stable distributions. The maximum likelihood estimator for the parameters of proposed finite mixture model is computed through the expectation-maximization algorithm. The proposed model contains the finite mixture of normal and skewed normal distributions. Since the tails of proposed model is heavier than even the Student's t distribution, it can be used as a powerful model for robust model-based clustering. Performance of the proposed model is demonstrated by clustering simulation data and two sets of real data.
    Capturing Graphs with Hypo-Elliptic Diffusions. (arXiv:2205.14092v1 [cs.LG])
    Convolutional layers within graph neural networks operate by aggregating information about local neighbourhood structures; one common way to encode such substructures is through random walks. The distribution of these random walks evolves according to a diffusion equation defined using the graph Laplacian. We extend this approach by leveraging classic mathematical results about hypo-elliptic diffusions. This results in a novel tensor-valued graph operator, which we call the hypo-elliptic graph Laplacian. We provide theoretical guarantees and efficient low-rank approximation algorithms. In particular, this gives a structured approach to capture long-range dependencies on graphs that is robust to pooling. Besides the attractive theoretical properties, our experiments show that this method competes with graph transformers on datasets requiring long-range reasoning but scales only linearly in the number of edges as opposed to quadratically in nodes.
    Towards Handling Uncertainty-at-Source in AI -- A Review and Next Steps for Interval Regression. (arXiv:2104.07245v2 [cs.LG] UPDATED)
    Most of statistics and AI draw insights through modelling discord or variance between sources of information (i.e., inter-source uncertainty). Increasingly, however, research is focusing upon uncertainty arising at the level of individual measurements (i.e., within- or intra-source), such as for a given sensor output or human response. Here, adopting intervals rather than numbers as the fundamental data-type provides an efficient, powerful, yet challenging way forward -- offering systematic capture of uncertainty-at-source, increasing informational capacity, and ultimately potential for insight. Following recent progress in the capture of interval-valued data, including from human participants, conducting machine learning directly upon intervals is a crucial next step. This paper focuses on linear regression for interval-valued data as a recent growth area, providing an essential foundation for broader use of intervals in AI. We conduct an in-depth analysis of state-of-the-art methods, elucidating their behaviour, advantages, and pitfalls when applied to datasets with different properties. Specific emphasis is given to the challenge of preserving mathematical coherence -- i.e., ensuring that models maintain fundamental mathematical properties of intervals throughout -- and the paper puts forward extensions to an existing approach to guarantee this. Carefully designed experiments, using both synthetic and real-world data, are conducted -- with findings presented alongside novel visualizations for interval-valued regression outputs, designed to maximise model interpretability. Finally, the paper makes recommendations concerning method suitability for data sets with specific properties and highlights remaining challenges and important next steps for developing AI with the capacity to handle uncertainty-at-source.
    Dual Convexified Convolutional Neural Networks. (arXiv:2205.14056v1 [cs.LG])
    We propose the framework of dual convexified convolutional neural networks (DCCNNs). In this framework, we first introduce a primal learning problem motivated from convexified convolutional neural networks (CCNNs), and then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates. Our approach reduces the memory overhead of constructing a large kernel matrix and eliminates the ambiguity of factorizing the matrix. Due to the low-rank structure in CCNNs and the related subdifferential of nuclear norms, there is no closed-form expression to recover the primal solution from the dual solution. To overcome this, we propose a highly novel weight recovery algorithm, which takes the dual solution and the kernel information as the input, and recovers the linear and convolutional weights of a CCNN. Furthermore, our recovery algorithm exploits the low-rank structure and imposes a small number of filters indirectly, which reduces the parameter size. As a result, DCCNNs inherit all the statistical benefits of CCNNs, while enjoying a more formal and efficient workflow.
    Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers. (arXiv:2205.13838v1 [cs.LG])
    Random Forests (RFs) are widely used Machine Learning models in low-power embedded devices, due to their hardware friendly operation and high accuracy on practically relevant tasks. The accuracy of a RF often increases with the number of internal weak learners (decision trees), but at the cost of a proportional increase in inference latency and energy consumption. Such costs can be mitigated considering that, in most applications, inputs are not all equally difficult to classify. Therefore, a large RF is often necessary only for (few) hard inputs, and wasteful for easier ones. In this work, we propose an early-stopping mechanism for RFs, which terminates the inference as soon as a high-enough classification confidence is reached, reducing the number of weak learners executed for easy inputs. The early-stopping confidence threshold can be controlled at runtime, in order to favor either energy saving or accuracy. We apply our method to three different embedded classification tasks, on a single-core RISC-V microcontroller, achieving an energy reduction from 38% to more than 90% with a drop of less than 0.5% in accuracy. We also show that our approach outperforms previous adaptive ML methods for RFs.
    DRLComplex: Reconstruction of protein quaternary structures using deep reinforcement learning. (arXiv:2205.13594v1 [cs.LG])
    Predicted inter-chain residue-residue contacts can be used to build the quaternary structure of protein complexes from scratch. However, only a small number of methods have been developed to reconstruct protein quaternary structures using predicted inter-chain contacts. Here, we present an agent-based self-learning method based on deep reinforcement learning (DRLComplex) to build protein complex structures using inter-chain contacts as distance constraints. We rigorously tested DRLComplex on two standard datasets of homodimeric and heterodimeric protein complexes (i.e., the CASP-CAPRI homodimer and Std_32 heterodimer datasets) using both true and predicted interchain contacts as inputs. Utilizing true contacts as input, DRLComplex achieved high average TM-scores of 0.9895 and 0.9881 and a low average interface RMSD (I_RMSD) of 0.2197 and 0.92 on the two datasets, respectively. When predicted contacts are used, the method achieves TM-scores of 0.73 and 0.76 for homodimers and heterodimers, respectively. Our experiments find that the accuracy of reconstructed quaternary structures depends on the accuracy of the contact predictions. Compared to other optimization methods for reconstructing quaternary structures from inter-chain contacts, DRLComplex performs similar to an advanced gradient descent method and better than a Markov Chain Monte Carlo simulation method and a simulated annealing-based method, validating the effectiveness of DRLComplex for quaternary reconstruction of protein complexes.
    BagFlip: A Certified Defense against Data Poisoning. (arXiv:2205.13634v1 [cs.LG])
    Machine learning models are vulnerable to data-poisoning attacks, in which an attacker maliciously modifies the training set to change the prediction of a learned model. In a trigger-less attack, the attacker can modify the training set but not the test inputs, while in a backdoor attack the attacker can also modify test inputs. Existing model-agnostic defense approaches either cannot handle backdoor attacks or do not provide effective certificates (i.e., a proof of a defense). We present BagFlip, a model-agnostic certified approach that can effectively defend against both trigger-less and backdoor attacks. We evaluate BagFlip on image classification and malware detection datasets. BagFlip is equal to or more effective than the state-of-the-art approaches for trigger-less attacks and more effective than the state-of-the-art approaches for backdoor attacks.
    Combining observational datasets from multiple environments to detect hidden confounding. (arXiv:2205.13935v1 [stat.ME])
    A common assumption in causal inference from observational data is the assumption of no hidden confounding. Yet it is, in general, impossible to verify the presence of hidden confounding factors from a single dataset. However, under the assumption of independent causal mechanisms underlying the data generative process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only violated during hidden confounding and examine cases where we break its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies.
    Deep Learning Fetal Ultrasound Video Model Match Human Observers in Biometric Measurements. (arXiv:2205.13835v1 [eess.IV])
    Objective. This work investigates the use of deep convolutional neural networks (CNN) to automatically perform measurements of fetal body parts, including head circumference, biparietal diameter, abdominal circumference and femur length, and to estimate gestational age and fetal weight using fetal ultrasound videos. Approach. We developed a novel multi-task CNN-based spatio-temporal fetal US feature extraction and standard plane detection algorithm (called FUVAI) and evaluated the method on 50 freehand fetal US video scans. We compared FUVAI fetal biometric measurements with measurements made by five experienced sonographers at two time points separated by at least two weeks. Intra- and inter-observer variabilities were estimated. Main results. We found that automated fetal biometric measurements obtained by FUVAI were comparable to the measurements performed by experienced sonographers The observed differences in measurement values were within the range of inter- and intra-observer variability. Moreover, analysis has shown that these differences were not statistically significant when comparing any individual medical expert to our model. Significance. We argue that FUVAI has the potential to assist sonographers who perform fetal biometric measurements in clinical settings by providing them with suggestions regarding the best measuring frames, along with automated measurements. Moreover, FUVAI is able perform these tasks in just a few seconds, which is a huge difference compared to the average of six minutes taken by sonographers. This is significant, given the shortage of medical experts capable of interpreting fetal ultrasound images in numerous countries.
    FNet: Mixing Tokens with Fourier Transforms. (arXiv:2105.03824v4 [cs.CL] UPDATED)
    We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
    Learning to Solve Combinatorial Graph Partitioning Problems via Efficient Exploration. (arXiv:2205.14105v1 [cs.LG])
    From logistics to the natural sciences, combinatorial optimisation on graphs underpins numerous real-world applications. Reinforcement learning (RL) has shown particular promise in this setting as it can adapt to specific problem structures and does not require pre-solved instances for these, often NP-hard, problems. However, state-of-the-art (SOTA) approaches typically suffer from severe scalability issues, primarily due to their reliance on expensive graph neural networks (GNNs) at each decision step. We introduce ECORD; a novel RL algorithm that alleviates this expense by restricting the GNN to a single pre-processing step, before entering a fast-acting exploratory phase directed by a recurrent unit. Experimentally, ECORD achieves a new SOTA for RL algorithms on the Maximum Cut problem, whilst also providing orders of magnitude improvement in speed and scalability. Compared to the nearest competitor, ECORD reduces the optimality gap by up to 73% on 500 vertex graphs with a decreased wall-clock time. Moreover, ECORD retains strong performance when generalising to larger graphs with up to 10000 vertices.
    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. (arXiv:2205.14135v1 [cs.LG])
    Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
    Spatio-Temporal Graph Few-Shot Learning with Cross-City Knowledge Transfer. (arXiv:2205.13947v1 [cs.LG])
    Spatio-temporal graph learning is a key method for urban computing tasks, such as traffic flow, taxi demand and air quality forecasting. Due to the high cost of data collection, some developing cities have few available data, which makes it infeasible to train a well-performed model. To address this challenge, cross-city knowledge transfer has shown its promise, where the model learned from data-sufficient cities is leveraged to benefit the learning process of data-scarce cities. However, the spatio-temporal graphs among different cities show irregular structures and varied features, which limits the feasibility of existing Few-Shot Learning (\emph{FSL}) methods. Therefore, we propose a model-agnostic few-shot learning framework for spatio-temporal graph called ST-GFSL. Specifically, to enhance feature extraction by transfering cross-city knowledge, ST-GFSL proposes to generate non-shared parameters based on node-level meta knowledge. The nodes in target city transfer the knowledge via parameter matching, retrieving from similar spatio-temporal characteristics. Furthermore, we propose to reconstruct the graph structure during meta-learning. The graph reconstruction loss is defined to guide structure-aware learning, avoiding structure deviation among different datasets. We conduct comprehensive experiments on four traffic speed prediction benchmarks and the results demonstrate the effectiveness of ST-GFSL compared with state-of-the-art methods.
    Fast Causal Orientation Learning in Directed Acyclic Graphs. (arXiv:2205.13919v1 [cs.LG])
    Causal relationships among a set of variables are commonly represented by a directed acyclic graph. The orientations of some edges in the causal DAG can be discovered from observational/interventional data. Further edges can be oriented by iteratively applying so-called Meek rules. Inferring edges' orientations from some previously oriented edges, which we call Causal Orientation Learning (COL), is a common problem in various causal discovery tasks. In these tasks, it is often required to solve multiple COL problems and therefore applying Meek rules could be time-consuming. Motivated by Meek rules, we introduce Meek functions that can be utilized in solving COL problems. In particular, we show that these functions have some desirable properties, enabling us to speed up the process of applying Meek rules. In particular, we propose a dynamic programming (DP) based method to apply Meek functions. Moreover, based on the proposed DP method, we present a lower bound on the number of edges that can be oriented as a result of intervention. We also propose a method to check whether some oriented edges belong to a causal DAG. Experimental results show that the proposed methods can outperform previous work in several causal discovery tasks in terms of running-time.
    Group-invariant max filtering. (arXiv:2205.14039v1 [cs.IT])
    Given a real inner product space $V$ and a group $G$ of linear isometries, we construct a family of $G$-invariant real-valued functions on $V$ that we call max filters. In the case where $V=\mathbb{R}^d$ and $G$ is finite, a suitable max filter bank separates orbits, and is even bilipschitz in the quotient metric. In the case where $V=L^2(\mathbb{R}^d)$ and $G$ is the group of translation operators, a max filter exhibits stability to diffeomorphic distortion like that of the scattering transform introduced by Mallat. We establish that max filters are well suited for various classification tasks, both in theory and in practice.
    Benign Overparameterization in Membership Inference with Early Stopping. (arXiv:2205.14055v1 [cs.LG])
    Does a neural network's privacy have to be at odds with its accuracy? In this work, we study the effects the number of training epochs and parameters have on a neural network's vulnerability to membership inference (MI) attacks, which aim to extract potentially private information about the training data. We first demonstrate how the number of training epochs and parameters individually induce a privacy-utility trade-off: more of either improves generalization performance at the expense of lower privacy. However, remarkably, we also show that jointly tuning both can eliminate this privacy-utility trade-off. Specifically, with careful tuning of the number of training epochs, more overparameterization can increase model privacy for fixed generalization error. To better understand these phenomena theoretically, we develop a powerful new leave-one-out analysis tool to study the asymptotic behavior of linear classifiers and apply it to characterize the sample-specific loss threshold MI attack in high-dimensional logistic regression. For practitioners, we introduce a low-overhead procedure to estimate MI risk and tune the number of training epochs to guard against MI attacks.
    Global Convergence of Over-parameterized Deep Equilibrium Models. (arXiv:2205.13814v1 [cs.LG])
    A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection. Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation. The training dynamics of over-parameterized DEQs are investigated in this study. By supposing a condition on the initial equilibrium point, we show that the unique equilibrium point always exists during the training process, and the gradient descent is proved to converge to a globally optimal solution at a linear convergence rate for the quadratic loss function. In order to show that the required initial condition is satisfied via mild over-parameterization, we perform a fine-grained analysis on random DEQs. We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.
    Emergent organization of receptive fields in networks of excitatory and inhibitory neurons. (arXiv:2205.13614v1 [q-bio.NC])
    Local patterns of excitation and inhibition that can generate neural waves are studied as a computational mechanism underlying the organization of neuronal tunings. Sparse coding algorithms based on networks of excitatory and inhibitory neurons are proposed that exhibit topographic maps as the receptive fields are adapted to input stimuli. Motivated by a leaky integrate-and-fire model of neural waves, we propose an activation model that is more typical of artificial neural networks. Computational experiments with the activation model using both natural images and natural language text are presented. In the case of images, familiar "pinwheel" patterns of oriented edge detectors emerge; in the case of text, the resulting topographic maps exhibit a 2-dimensional representation of granular word semantics. Experiments with a synthetic model of somatosensory input are used to investigate how the network dynamics may affect plasticity of neuronal maps under changes to the inputs.
    On the Convergence of Semi-Relaxed Sinkhorn with Marginal Constraint and OT Distance Gaps. (arXiv:2205.13846v1 [cs.LG])
    This paper presents consideration of the Semi-Relaxed Sinkhorn (SR-Sinkhorn) algorithm for the semi-relaxed optimal transport (SROT) problem, which relaxes one marginal constraint of the standard OT problem. For evaluation of how the constraint relaxation affects the algorithm behavior and solution, it is vitally necessary to present the theoretical convergence analysis in terms not only of the functional value gap, but also of the marginal constraint gap as well as the OT distance gap. However, no existing work has addressed all analyses simultaneously. To this end, this paper presents a comprehensive convergence analysis for SR-Sinkhorn. After presenting the $\epsilon$-approximation of the functional value gap based on a new proof strategy and exploiting this proof strategy, we give the upper bound of the marginal constraint gap. We also provide its convergence to the $\epsilon$-approximation when two distributions are in the probability simplex. Furthermore, the convergence analysis of the OT distance gap to the $\epsilon$-approximation is given as assisted by the obtained marginal constraint gap. The latter two theoretical results are the first results presented in the literature related to the SROT problem.
    Learning Dynamical Systems via Koopman Operator Regression in Reproducing Kernel Hilbert Spaces. (arXiv:2205.14027v1 [cs.LG])
    We study a class of dynamical systems modelled as Markov chains that admit an invariant distribution via the corresponding transfer, or Koopman, operator. While data-driven algorithms to reconstruct such operators are well known, their relationship with statistical learning is largely unexplored. We formalize a framework to learn the Koopman operator from finite data trajectories of the dynamical system. We consider the restriction of this operator to a reproducing kernel Hilbert space and introduce a notion of risk, from which different estimators naturally arise. We link the risk with the estimation of the spectral decomposition of the Koopman operator. These observations motivate a reduced-rank operator regression (RRR) estimator. We derive learning bounds for the proposed estimator, holding both in i.i.d. and non i.i.d. settings, the latter in terms of mixing coefficients. Our results suggest RRR might be beneficial over other widely used estimators as confirmed in numerical experiments both for forecasting and mode decomposition.
    Improving Bidding and Playing Strategies in the Trick-Taking game Wizard using Deep Q-Networks. (arXiv:2205.13834v1 [cs.LG])
    In this work, the trick-taking game Wizard with a separate bidding and playing phase is modeled by two interleaved partially observable Markov decision processes (POMDP). Deep Q-Networks (DQN) are used to empower self-improving agents, which are capable of tackling the challenges of a highly non-stationary environment. To compare algorithms between each other, the accuracy between bid and trick count is monitored, which strongly correlates with the actual rewards and provides a well-defined upper and lower performance bound. The trained DQN agents achieve accuracies between 66% and 87% in self-play, leaving behind both a random baseline and a rule-based heuristic. The conducted analysis also reveals a strong information asymmetry concerning player positions during bidding. To overcome the missing Markov property of imperfect-information games, a long short-term memory (LSTM) network is implemented to integrate historic information into the decision-making process. Additionally, a forward-directed tree search is conducted by sampling a state of the environment and thereby turning the game into a perfect information setting. To our surprise, both approaches do not surpass the performance of the basic DQN agent.
    Probabilistic Systems with Hidden State and Unobservable Transitions. (arXiv:2205.13871v1 [cs.LG])
    We consider probabilistic systems with hidden state and unobservable transitions, an extension of Hidden Markov Models (HMMs) that in particular admits unobservable {\epsilon}-transitions (also called null transitions), allowing state changes of which the observer is unaware. Due to the presence of {\epsilon}-loops this additional feature complicates the theory and requires to carefully set up the corresponding probability space and random variables. In particular we present an algorithm for determining the most probable explanation given an observation (a generalization of the Viterbi algorithm for HMMs) and a method for parameter learning that adapts the probabilities of a given model based on an observation (a generalization of the Baum-Welch algorithm). The latter algorithm guarantees that the given observation has a higher (or equal) probability after adjustment of the parameters and its correctness can be derived directly from the so-called EM algorithm.
    Standalone Neural ODEs with Sensitivity Analysis. (arXiv:2205.13933v1 [cs.LG])
    This paper presents the Standalone Neural ODE (sNODE), a continuous-depth neural ODE model capable of describing a full deep neural network. This uses a novel nonlinear conjugate gradient (NCG) descent optimization scheme for training, where the Sobolev gradient can be incorporated to improve smoothness of model weights. We also present a general formulation of the neural sensitivity problem and show how it is used in the NCG training. The sensitivity analysis provides a reliable measure of uncertainty propagation throughout a network, and can be used to study model robustness and to generate adversarial attacks. Our evaluations demonstrate that our novel formulations lead to increased robustness and performance as compared to ResNet models, and that it opens up for new opportunities for designing and developing machine learning with improved explainability.
    Sample-Efficient Optimisation with Probabilistic Transformer Surrogates. (arXiv:2205.13902v1 [cs.LG])
    Faced with problems of increasing complexity, recent research in Bayesian Optimisation (BO) has focused on adapting deep probabilistic models as flexible alternatives to Gaussian Processes (GPs). In a similar vein, this paper investigates the feasibility of employing state-of-the-art probabilistic transformers in BO. Upon further investigation, we observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. First, we notice that these models are trained on uniformly distributed inputs, which impairs predictive accuracy on non-uniform data - a setting arising from any typical BO loop due to exploration-exploitation trade-offs. Second, we realise that training losses (e.g., cross-entropy) only asymptotically guarantee accurate posterior approximations, i.e., after arriving at the global optimum, which generally cannot be ensured. At the stationary points of the loss function, however, we observe a degradation in predictive performance especially in exploratory regions of the input space. To tackle these shortcomings we introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance. In a large panel of experiments, we demonstrate, for the first time, that one transformer pre-trained on data sampled from random GP priors produces competitive results on 16 benchmark black-boxes compared to GP-based BO. Since our model is only pre-trained once and used in all tasks without any retraining and/or fine-tuning, we report an order of magnitude time-reduction, while matching and sometimes outperforming GPs.
    Counterfactual Analysis in Dynamic Models: Copulas and Bounds. (arXiv:2205.13832v1 [cs.LG])
    We provide an explicit model of the causal mechanism in a structural causal model (SCM) with the goal of estimating counterfactual quantities of interest (CQIs). We propose some standard dependence structures, i.e. copulas, as base cases for the causal mechanism. While these base cases can be used to construct more interesting copulas, there are uncountably many copulas in general and so we formulate optimization problems for bounding the CQIs. As our ultimate goal is counterfactual reasoning in dynamic models which may have latent-states, we show by way of example that filtering / smoothing / sampling methods for these models can be integrated with our modeling of the causal mechanism. Specifically, we consider the "cheating-at-the-casino" application of a hidden Markov model and use linear programming (LP) to construct lower and upper bounds on the casino's winnings due to cheating. These bounds are considerably tighter when we constrain the copulas in the LPs to be time-independent. We can characterize the entire space of SCMs obeying counterfactual stability (CS), and we use it to negatively answer the open question of Oberst and Sontag [18] regarding the uniqueness of the Gumbel-max mechanism for modeling CS. Our work has applications in epidemiology and legal reasoning, and more generally in counterfactual off-policy evaluation, a topic of increasing interest in the reinforcement learning community.
    On Consistency in Graph Neural Network Interpretation. (arXiv:2205.13733v1 [cs.LG])
    Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over recent years. Instance-level GNN explanation aims to discover critical input elements, like nodes or edges, that the target GNN relies upon for making predictions. These identified sub-structures can provide interpretations of GNN's behavior. Though various algorithms are proposed, most of them formalize this task by searching the minimal subgraph which can preserve original predictions. An inductive bias is deep-rooted in this framework: the same output cannot guarantee that two inputs are processed under the same rationale. Consequently, they have the danger of providing spurious explanations and fail to provide consistent explanations. Applying them to explain weakly-performed GNNs would further amplify these issues. To address the issues, we propose to obtain more faithful and consistent explanations of GNNs. After a close examination on predictions of GNNs from the causality perspective, we attribute spurious explanations to two typical reasons: confounding effect of latent variables like distribution shift, and causal factors distinct from the original input. Motivated by the observation that both confounding effects and diverse causal rationales are encoded in internal representations, we propose a simple yet effective countermeasure by aligning embeddings. This new objective can be incorporated into existing GNN explanation algorithms with no effort. We implement both a simplified version based on absolute distance and a distribution-aware version based on anchors. Experiments on $5$ datasets validate its effectiveness, and theoretical analysis shows that it is in effect optimizing a more faithful explanation objective in design, which further justifies the proposed approach.
    Feudal Multi-Agent Reinforcement Learning with Adaptive Network Partition for Traffic Signal Control. (arXiv:2205.13836v1 [cs.MA])
    Multi-agent reinforcement learning (MARL) has been applied and shown great potential in multi-intersections traffic signal control, where multiple agents, one for each intersection, must cooperate together to optimize traffic flow. To encourage global cooperation, previous work partitions the traffic network into several regions and learns policies for agents in a feudal structure. However, static network partition fails to adapt to dynamic traffic flow, which will changes frequently over time. To address this, we propose a novel feudal MARL approach with adaptive network partition. Specifically, we first partition the network into several regions according to the traffic flow. To do this, we propose two approaches: one is directly to use graph neural network (GNN) to generate the network partition, and the other is to use Monte-Carlo tree search (MCTS) to find the best partition with criteria computed by GNN. Then, we design a variant of Qmix using GNN to handle various dimensions of input, given by the dynamic network partition. Finally, we use a feudal hierarchy to manage agents in each partition and promote global cooperation. By doing so, agents are able to adapt to the traffic flow as required in practice. We empirically evaluate our method both in a synthetic traffic grid and real-world traffic networks of three cities, widely used in the literature. Our experimental results confirm that our method can achieve better performance, in terms of average travel time and queue length, than several leading methods for traffic signal control.
    Global Normalization for Streaming Speech Recognition in a Modular Framework. (arXiv:2205.13674v1 [cs.LG])
    We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50\% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models.
    Isolating and Leveraging Controllable and Noncontrollable Visual Dynamics in World Models. (arXiv:2205.13817v1 [cs.LG])
    World models learn the consequences of actions in vision-based interactive systems. However, in practical scenarios such as autonomous driving, there commonly exists noncontrollable dynamics independent of the action signals, making it difficult to learn effective world models. To tackle this problem, we present a novel reinforcement learning approach named Iso-Dream, which improves the Dream-to-Control framework in two aspects. First, by optimizing the inverse dynamics, we encourage the world model to learn controllable and noncontrollable sources of spatiotemporal changes on isolated state transition branches. Second, we optimize the behavior of the agent on the decoupled latent imaginations of the world model. Specifically, to estimate state values, we roll-out the noncontrollable states into the future and associate them with the current controllable state. In this way, the isolation of dynamics sources can greatly benefit long-horizon decision-making of the agent, such as a self-driving car that can avoid potential risks by anticipating the movement of other vehicles. Experiments show that Iso-Dream is effective in decoupling the mixed dynamics and remarkably outperforms existing approaches in a wide range of visual control and prediction domains.
    Can Foundation Models Help Us Achieve Perfect Secrecy?. (arXiv:2205.13722v1 [cs.LG])
    A key promise of machine learning is the ability to assist users with personal tasks. Because the personal context required to make accurate predictions is often sensitive, we require systems that protect privacy. A gold standard privacy-preserving system will satisfy perfect secrecy, meaning that interactions with the system provably reveal no additional private information to adversaries. This guarantee should hold even as we perform multiple personal tasks over the same underlying data. However, privacy and quality appear to be in tension in existing systems for personal tasks. Neural models typically require lots of training to perform well, while individual users typically hold a limited scale of data, so the systems propose to learn from the aggregate data of multiple users. This violates perfect secrecy and instead, in the last few years, academics have defended these solutions using statistical notions of privacy -- i.e., the probability of learning private information about a user should be reasonably low. Given the vulnerabilities of these solutions, we explore whether the strong perfect secrecy guarantee can be achieved using recent zero-to-few sample adaptation techniques enabled by foundation models. In response, we propose FOCUS, a framework for personal tasks. Evaluating on popular privacy benchmarks, we find the approach, satisfying perfect secrecy, competes with strong collaborative learning baselines on 6 of 7 tasks. We empirically analyze the proposal, highlighting the opportunities and limitations across task types, and model inductive biases and sizes.
    Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks. (arXiv:2205.13816v1 [q-bio.NC])
    Visual object recognition has been extensively studied in both neuroscience and computer vision. Recently, the most popular class of artificial systems for this task, deep convolutional neural networks (CNNs), has been shown to provide excellent models for its functional analogue in the brain, the ventral stream in visual cortex. This has prompted questions on what, if any, are the common principles underlying the reformatting of visual information as it flows through a CNN or the ventral stream. Here we consider some prominent statistical patterns that are known to exist in the internal representations of either CNNs or the visual cortex and look for them in the other system. We show that intrinsic dimensionality (ID) of object representations along the rat homologue of the ventral stream presents two distinct expansion-contraction phases, as previously shown for CNNs. Conversely, in CNNs, we show that training results in both distillation and active pruning (mirroring the increase in ID) of low- to middle-level image information in single units, as representations gain the ability to support invariant discrimination, in agreement with previous observations in rat visual cortex. Taken together, our findings suggest that CNNs and visual cortex share a similarly tight relationship between dimensionality expansion/reduction of object representations and reformatting of image information.
    FedFormer: Contextual Federation with Attention in Reinforcement Learning. (arXiv:2205.13697v1 [cs.LG])
    A core issue in federated reinforcement learning is defining how to aggregate insights from multiple agents into one. This is commonly done by taking the average of each participating agent's model weights into one common model (FedAvg). We instead propose FedFormer, a novel federation strategy that utilizes Transformer Attention to contextually aggregate embeddings from models originating from different learner agents. In so doing, we attentively weigh contributions of other agents with respect to the current agent's environment and learned relationships, thus providing more effective and efficient federation. We evaluate our methods on the Meta-World environment and find that our approach yields significant improvements over FedAvg and non-federated Soft Actor Critique single agent methods. Our results compared to Soft Actor Critique show that FedFormer performs better while still abiding by the privacy constraints of federated learning. In addition, we demonstrate nearly linear improvements in effectiveness with increased agent pools in certain tasks. This is contrasted by FedAvg, which fails to make noticeable improvements when scaled.
    Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4. (arXiv:2203.06649v2 [q-bio.NC] UPDATED)
    Modern high-scoring models of vision in the brain score competition do not stem from Vision Transformers. However, in this paper, we provide evidence against the unexpected trend of Vision Transformers (ViT) being not perceptually aligned with human visual representations by showing how a dual-stream Transformer, a CrossViT$~\textit{a la}$ Chen et al. (2021), under a joint rotationally-invariant and adversarial optimization procedure yields 2nd place in the aggregate Brain-Score 2022 competition(Schrimpf et al., 2020b) averaged across all visual categories, and at the time of the competition held 1st place for the highest explainable variance of area V4. In addition, our current Transformer-based model also achieves greater explainable variance for areas V4, IT and Behaviour than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like computation module (Dapello et al.,2020). To assess the contribution of the optimization scheme with respect to the CrossViT architecture, we perform several additional experiments on differently optimized CrossViT's regarding adversarial robustness, common corruption benchmarks, mid-ventral stimuli interpretation and feature inversion. Against our initial expectations, our family of results provides tentative support for an $\textit{"All roads lead to Rome"}$ argument enforced via a joint optimization rule even for non biologically-motivated models of vision such as Vision Transformers. Code is available at https://github.com/williamberrios/BrainScore-Transformers
    Effective Abstract Reasoning with Dual-Contrast Network. (arXiv:2205.13720v1 [cs.CV])
    As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven's Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence. Unlike previous methods that use auxiliary annotations or assume hidden rules to produce appropriate feature representation, we only use the ground truth answer of each question for model learning, aiming for an intelligent agent to have a strong learning capability with a small amount of supervision. Based on the RPM problem formulation, the correct answer filled into the missing entry of the third row/column has to best satisfy the same rules shared between the first two rows/columns. Thus we design a simple yet effective Dual-Contrast Network (DCNet) to exploit the inherent structure of RPM puzzles. Specifically, a rule contrast module is designed to compare the latent rules between the filled row/column and the first two rows/columns; a choice contrast module is designed to increase the relative differences between candidate choices. Experimental results on the RAVEN and PGM datasets show that DCNet outperforms the state-of-the-art methods by a large margin of 5.77%. Further experiments on few training samples and model generalization also show the effectiveness of DCNet. Code is available at https://github.com/visiontao/dcnet.
    A Simple and Universal Rotation Equivariant Point-cloud Network. (arXiv:2203.01216v3 [cs.LG] UPDATED)
    Equivariance to permutations and rigid motions is an important inductive bias for various 3D learning problems. Recently it has been shown that the equivariant Tensor Field Network architecture is universal -- it can approximate any equivariant function. In this paper we suggest a much simpler architecture, prove that it enjoys the same universality guarantees and evaluate its performance on Modelnet40. The code to reproduce our experiments is available at \url{https://github.com/simpleinvariance/UniversalNetwork}
    RIGID: Robust Linear Regression with Missing Data. (arXiv:2205.13635v1 [cs.LG])
    We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques.
    Subverting machines, fluctuating identities: Re-learning human categorization. (arXiv:2205.13740v1 [cs.LG])
    Most machine learning systems that interact with humans construct some notion of a person's "identity," yet the default paradigm in AI research envisions identity with essential attributes that are discrete and static. In stark contrast, strands of thought within critical theory present a conception of identity as malleable and constructed entirely through interaction; a doing rather than a being. In this work, we distill some of these ideas for machine learning practitioners and introduce a theory of identity as autopoiesis, circular processes of formation and function. We argue that the default paradigm of identity used by the field immobilizes existing identity categories and the power differentials that co$\unicode{x2010}$occur, due to the absence of iterative feedback to our models. This includes a critique of emergent AI fairness practices that continue to impose the default paradigm. Finally, we apply our theory to sketch approaches to autopoietic identity through multilevel optimization and relational learning. While these ideas raise many open questions, we imagine the possibilities of machines that are capable of expressing human identity as a relationship perpetually in flux.
    Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games. (arXiv:2205.13746v1 [math.OC])
    We study the problem of finding the Nash equilibrium in a two-player zero-sum Markov game. Due to its formulation as a minimax optimization program, a natural approach to solve the problem is to perform gradient descent/ascent with respect to each player in an alternating fashion. However, due to the non-convexity/non-concavity of the underlying objective function, theoretical understandings of this method are limited. In our paper, we consider solving an entropy-regularized variant of the Markov game. The regularization introduces structure into the optimization landscape that make the solutions more identifiable and allow the problem to be solved more efficiently. Our main contribution is to show that under proper choices of the regularization parameter, the gradient descent ascent algorithm converges to the Nash equilibrium of the original unregularized problem. We explicitly characterize the finite-time performance of the last iterate of our algorithm, which vastly improves over the existing convergence bound of the gradient descent ascent algorithm without regularization. Finally, we complement the analysis with numerical simulations that illustrate the accelerated convergence of the algorithm.
    A Sea of Words: An In-Depth Analysis of Anchors for Text Data. (arXiv:2205.13789v1 [stat.ML])
    Anchors [Ribeiro et al. (2018)] is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. We leverage this analysis to gain insights on the behavior of Anchors on simple models, including elementary if-then rules and linear classifiers.
    Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. (arXiv:2205.13760v1 [cs.LG])
    The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.
    Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization. (arXiv:2112.08217v2 [stat.ML] UPDATED)
    Generative networks are often trained to minimize a statistical divergence between the reference distribution and the generative one in an adversarial setting. Some works trained instead generative networks to minimize Scoring Rules, functions assessing how well the generative distribution matches each training sample individually. We show how the Scoring Rule formulation easily extends to the so-called prequential (predictive-sequential) score, whose minimization allows performing probabilistic forecasting with generative networks. This objective leads to adversarial-free training, therefore easily avoiding uncertainty underestimation due to mode collapse, which is a common issue in the adversarial setting and undesirable for probabilistic forecasting. We provide consistency guarantees for the minimizer of the prequential score and employ that to perform probabilistic forecasting for two chaotic dynamical models and a benchmark dataset of global weather observations. For this last example, we define scoring rules for spatial data by drawing from the relevant literature, with which we obtain better uncertainty quantification with little hyperparameter tuning compared to adversarial training.
    Bootstrapping Informative Graph Augmentation via A Meta Learning Approach. (arXiv:2201.03812v3 [cs.LG] UPDATED)
    Recent works explore learning graph representations in a self-supervised manner. In graph contrastive learning, benchmark methods apply various graph augmentation approaches. However, most of the augmentation methods are non-learnable, which causes the issue of generating unbeneficial augmented graphs. Such augmentation may degenerate the representation ability of graph contrastive learning methods. Therefore, we motivate our method to generate augmented graph by a learnable graph augmenter, called MEta Graph Augmentation (MEGA). We then clarify that a "good" graph augmentation must have uniformity at the instance-level and informativeness at the feature-level. To this end, we propose a novel approach to learning a graph augmenter that can generate an augmentation with uniformity and informativeness. The objective of the graph augmenter is to promote our feature extraction network to learn a more discriminative feature representation, which motivates us to propose a meta-learning paradigm. Empirically, the experiments across multiple benchmark datasets demonstrate that MEGA outperforms the state-of-the-art methods in graph self-supervised learning tasks. Further experimental studies prove the effectiveness of different terms of MEGA.
    Generating personalized counterfactual interventions for algorithmic recourse by eliciting user preferences. (arXiv:2205.13743v1 [cs.LG])
    Counterfactual interventions are a powerful tool to explain the decisions of a black-box decision process, and to enable algorithmic recourse. They are a sequence of actions that, if performed by a user, can overturn an unfavourable decision made by an automated decision system. However, most of the current methods provide interventions without considering the user's preferences. For example, a user might prefer doing certain actions with respect to others. In this work, we present the first human-in-the-loop approach to perform algorithmic recourse by eliciting user preferences. We introduce a polynomial procedure to ask choice-set questions which maximize the Expected Utility of Selection (EUS), and use it to iteratively refine our cost estimates in a Bayesian setting. We integrate this preference elicitation strategy into a reinforcement learning agent coupled with Monte Carlo Tree Search for efficient exploration, so as to provide personalized interventions achieving algorithmic recourse. An experimental evaluation on synthetic and real-world datasets shows that a handful of queries allows to achieve a substantial reduction in the cost of interventions with respect to user-independent alternatives.
    (De-)Randomized Smoothing for Decision Stump Ensembles. (arXiv:2205.13909v1 [cs.LG])
    Tree-based models are used in many high-stakes application domains such as finance and medicine, where robustness and interpretability are of utmost importance. Yet, methods for improving and certifying their robustness are severely under-explored, in contrast to those focusing on neural networks. Targeting this important challenge, we propose deterministic smoothing for decision stump ensembles. Whereas most prior work on randomized smoothing focuses on evaluating arbitrary base models approximately under input randomization, the key insight of our work is that decision stump ensembles enable exact yet efficient evaluation via dynamic programming. Importantly, we obtain deterministic robustness certificates, even jointly over numerical and categorical features, a setting ubiquitous in the real world. Further, we derive an MLE-optimal training method for smoothed decision stumps under randomization and propose two boosting approaches to improve their provable robustness. An extensive experimental evaluation shows that our approach yields significantly higher certified accuracies than the state-of-the-art for tree-based models. We release all code and trained models at ANONYMIZED.
    Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions. (arXiv:2205.13803v1 [cs.CV])
    A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts. We introduce Bongard-HOI, a new visual reasoning benchmark that focuses on compositional learning of human-object interactions (HOIs) from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. We carefully curate the few-shot instances with hard negatives, where positive and negative images only disagree on action labels, making mere recognition of object categories insufficient to complete our benchmarks. We also design multiple test sets to systematically study the generalization of visual learning models, where we vary the overlap of the HOI concepts between the training and test sets of few-shot instances, from partial to no overlaps. Bongard-HOI presents a substantial challenge to today's visual recognition models. The state-of-the-art HOI detection model achieves only 62% accuracy on few-shot binary prediction while even amateur human testers on MTurk have 91% accuracy. With the Bongard-HOI benchmark, we hope to further advance research efforts in visual reasoning, especially in holistic perception-reasoning systems and better representation learning.
    VectorAdam for Rotation Equivariant Geometry Optimization. (arXiv:2205.13599v1 [cs.LG])
    The rise of geometric problems in machine learning has necessitated the development of equivariant methods, which preserve their output under the action of rotation or some other transformation. At the same time, the Adam optimization algorithm has proven remarkably effective across machine learning and even traditional tasks in geometric optimization. In this work, we observe that naively applying Adam to optimize vector-valued data is not rotation equivariant, due to per-coordinate moment updates, and in fact this leads to significant artifacts and biases in practice. We propose to resolve this deficiency with VectorAdam, a simple modification which makes Adam rotation-equivariant by accounting for the vector structure of optimization variables. We demonstrate this approach on problems in machine learning and traditional geometric optimization, showing that equivariant VectorAdam resolves the artifacts and biases of traditional Adam when applied to vector-valued data, with equivalent or even improved rates of convergence.
    Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters. (arXiv:2205.13703v1 [cs.LG])
    Motivated by the success of ensembles for uncertainty estimation in supervised learning, we take a renewed look at how ensembles of $Q$-functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL). We begin by identifying a critical flaw in a popular algorithmic choice used by many ensemble-based RL algorithms, namely the use of shared pessimistic target values when computing each ensemble member's Bellman error. Through theoretical analyses and construction of examples in toy MDPs, we demonstrate that shared pessimistic targets can paradoxically lead to value estimates that are effectively optimistic. Given this result, we propose MSG, a practical offline RL algorithm that trains an ensemble of $Q$-functions with independently computed targets based on completely separate networks, and optimizes a policy with respect to the lower confidence bound of predicted action values. Our experiments on the popular D4RL and RL Unplugged offline RL benchmarks demonstrate that on challenging domains such as antmazes, MSG with deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide margin. Additionally, through ablations on benchmarks domains, we verify the critical significance of using independently trained $Q$-functions, and study the role of ensemble size. Finally, as using separate networks per ensemble member can become computationally costly with larger neural network architectures, we investigate whether efficient ensemble approximations developed for supervised learning can be similarly effective, and demonstrate that they do not match the performance and robustness of MSG with separate networks, highlighting the need for new efforts into efficient uncertainty estimation directed at RL.
    Inference and Sampling for Archimax Copulas. (arXiv:2205.14025v1 [stat.ME])
    Understanding multivariate dependencies in both the bulk and the tails of a distribution is an important problem for many applications, such as ensuring algorithms are robust to observations that are infrequent but have devastating effects. Archimax copulas are a family of distributions endowed with a precise representation that allows simultaneous modeling of the bulk and the tails of a distribution. Rather than separating the two as is typically done in practice, incorporating additional information from the bulk may improve inference of the tails, where observations are limited. Building on the stochastic representation of Archimax copulas, we develop a non-parametric inference method and sampling algorithm. Our proposed methods, to the best of our knowledge, are the first that allow for highly flexible and scalable inference and sampling algorithms, enabling the increased use of Archimax copulas in practical settings. We experimentally compare to state-of-the-art density modeling techniques, and the results suggest that the proposed method effectively extrapolates to the tails while scaling to higher dimensional data. Our findings suggest that the proposed algorithms can be used in a variety of applications where understanding the interplay between the bulk and the tails of a distribution is necessary, such as healthcare and safety.
    Solving infinite-horizon POMDPs with memoryless stochastic policies in state-action space. (arXiv:2205.14098v1 [cs.LG])
    Reward optimization in fully observable Markov decision processes is equivalent to a linear program over the polytope of state-action frequencies. Taking a similar perspective in the case of partially observable Markov decision processes with memoryless stochastic policies, the problem was recently formulated as the optimization of a linear objective subject to polynomial constraints. Based on this we present an approach for Reward Optimization in State-Action space (ROSA). We test this approach experimentally in maze navigation tasks. We find that ROSA is computationally efficient and can yield stability improvements over other existing methods.
    Spartan: Differentiable Sparsity via Regularized Transportation. (arXiv:2205.14107v1 [cs.LG])
    We present Spartan, a method for training sparse neural network models with a predetermined level of sparsity. Spartan is based on a combination of two techniques: (1) soft top-k masking of low-magnitude parameters via a regularized optimal transportation problem and (2) dual averaging-based parameter updates with hard sparsification in the forward pass. This scheme realizes an exploration-exploitation tradeoff: early in training, the learner is able to explore various sparsity patterns, and as the soft top-k approximation is gradually sharpened over the course of training, the balance shifts towards parameter optimization with respect to a fixed sparsity mask. Spartan is sufficiently flexible to accommodate a variety of sparsity allocation policies, including both unstructured and block structured sparsity, as well as general cost-sensitive sparsity allocation mediated by linear models of per-parameter costs. On ImageNet-1K classification, Spartan yields 95% sparse ResNet-50 models and 90% block sparse ViT-B/16 models while incurring absolute top-1 accuracy losses of less than 1% compared to fully dense training.
    Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation. (arXiv:2107.06011v3 [cs.CV] UPDATED)
    In the context of visual navigation, the capacity to map a novel environment is necessary for an agent to exploit its observation history in the considered place and efficiently reach known goals. This ability can be associated with spatial reasoning, where an agent is able to perceive spatial relationships and regularities, and discover object characteristics. Recent work introduces learnable policies parametrized by deep neural networks and trained with Reinforcement Learning (RL). In classical RL setups, the capacity to map and reason spatially is learned end-to-end, from reward alone. In this setting, we introduce supplementary supervision in the form of auxiliary tasks designed to favor the emergence of spatial perception capabilities in agents trained for a goal-reaching downstream objective. We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings. Our method significantly improves the performance of different baseline agents, that either build an explicit or implicit representation of the environment, even matching the performance of incomparable oracle agents taking ground-truth maps as input. A learning-based agent from the literature trained with the proposed auxiliary losses was the winning entry to the Multi-Object Navigation Challenge, part of the CVPR 2021 Embodied AI Workshop.
    Learning to Control Linear Systems can be Hard. (arXiv:2205.14035v1 [cs.LG])
    In this paper, we study the statistical difficulty of learning to control linear systems. We focus on two standard benchmarks, the sample complexity of stabilization, and the regret of the online learning of the Linear Quadratic Regulator (LQR). Prior results state that the statistical difficulty for both benchmarks scales polynomially with the system state dimension up to system-theoretic quantities. However, this does not reveal the whole picture. By utilizing minimax lower bounds for both benchmarks, we prove that there exist non-trivial classes of systems for which learning complexity scales dramatically, i.e. exponentially, with the system dimension. This situation arises in the case of underactuated systems, i.e. systems with fewer inputs than states. Such systems are structurally difficult to control and their system theoretic quantities can scale exponentially with the system dimension dominating learning complexity. Under some additional structural assumptions (bounding systems away from uncontrollability), we provide qualitatively matching upper bounds. We prove that learning complexity can be at most exponential with the controllability index of the system, that is the degree of underactuation.
    Probabilistic Transformer: Modelling Ambiguities and Distributions for RNA Folding and Molecule Design. (arXiv:2205.13927v1 [cs.LG])
    Our world is ambiguous and this is reflected in the data we use to train our algorithms. This is especially true when we try to model natural processes where collected data is affected by noisy measurements and differences in measurement techniques. Sometimes, the process itself can be ambiguous, such as in the case of RNA folding, where a single nucleotide sequence can fold into multiple structures. This ambiguity suggests that a predictive model should have similar probabilistic characteristics to match the data it models. Therefore, we propose a hierarchical latent distribution to enhance one of the most successful deep learning models, the Transformer, to accommodate ambiguities and data distributions. We show the benefits of our approach on a synthetic task, with state-of-the-art results in RNA folding, and demonstrate its generative capabilities on property-based molecule design, outperforming existing work.
    Composing Partial Differential Equations with Physics-Aware Neural Networks. (arXiv:2111.11798v2 [cs.LG] UPDATED)
    We introduce a compositional physics-aware FInite volume Neural Network (FINN) for learning spatiotemporal advection-diffusion processes. FINN implements a new way of combining the learning abilities of artificial neural networks with physical and structural knowledge from numerical simulation by modeling the constituents of partial differential equations (PDEs) in a compositional manner. Results on both one- and two-dimensional PDEs (Burgers', diffusion-sorption, diffusion-reaction, Allen--Cahn) demonstrate FINN's superior modeling accuracy and excellent out-of-distribution generalization ability beyond initial and boundary conditions. With only one tenth of the number of parameters on average, FINN outperforms pure machine learning and other state-of-the-art physics-aware models in all cases -- often even by multiple orders of magnitude. Moreover, FINN outperforms a calibrated physical model when approximating sparse real-world data in a diffusion-sorption scenario, confirming its generalization abilities and showing explanatory potential by revealing the unknown retardation factor of the observed process.
    Bias Reduction via Cooperative Bargaining in Synthetic Graph Dataset Generation. (arXiv:2205.13901v1 [cs.LG])
    In general, to draw robust conclusions from a dataset, all the analyzed population must be represented on said dataset. Having a dataset that does not fulfill this condition normally leads to selection bias. Additionally, graphs have been used to model a wide variety of problems. Although synthetic graphs can be used to augment available real graph datasets to overcome selection bias, the generation of unbiased synthetic datasets is complex with current tools. In this work, we propose a method to find a synthetic graph dataset that has an even representation of graphs with different metrics. The resulting dataset can then be used, among others, for benchmarking graph processing techniques as the accuracy of different Graph Neural Network (GNN) models or the speedups obtained by different graph processing acceleration frameworks.
    MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection for Domain Generalization Task. (arXiv:2205.13879v1 [cs.SD])
    We present a machine sound dataset to benchmark domain generalization techniques for anomalous sound detection (ASD). To handle performance degradation caused by domain shifts that are difficult to detect or too frequent to adapt, domain generalization techniques are preferred. However, currently available datasets have difficulties in evaluating these techniques, such as limited number of values for parameters that cause domain shifts (domain shift parameters). In this paper, we present the first ASD dataset for the domain generalization techniques, called MIMII DG. The dataset consists of five machine types and three domain shift scenarios for each machine type. We prepared at least two values for the domain shift parameters in the source domain. Also, we introduced domain shifts that can be difficult to notice. Experimental results using two baseline systems indicate that the dataset reproduces the domain shift scenarios and is useful for benchmarking domain generalization techniques.
    Non-Markovian policies occupancy measures. (arXiv:2205.13950v1 [cs.LG])
    A central object of study in Reinforcement Learning (RL) is the Markovian policy, in which an agent's actions are chosen from a memoryless probability distribution, conditioned only on its current state. The family of Markovian policies is broad enough to be interesting, yet simple enough to be amenable to analysis. However, RL often involves more complex policies: ensembles of policies, policies over options, policies updated online, etc. Our main contribution is to prove that the occupancy measure of any non-Markovian policy, i.e., the distribution of transition samples collected with it, can be equivalently generated by a Markovian policy. This result allows theorems about the Markovian policy class to be directly extended to its non-Markovian counterpart, greatly simplifying proofs, in particular those involving replay buffers and datasets. We provide various examples of such applications to the field of Reinforcement Learning.
    Counterfactual Fairness with Partially Known Causal Graph. (arXiv:2205.13972v1 [cs.LG])
    Fair machine learning aims to avoid treating individuals or sub-populations unfavourably based on \textit{sensitive attributes}, such as gender and race. Those methods in fair machine learning that are built on causal inference ascertain discrimination and bias through causal effects. Though causality-based fair learning is attracting increasing attention, current methods assume the true causal graph is fully known. This paper proposes a general method to achieve the notion of counterfactual fairness when the true causal graph is unknown. To be able to select features that lead to counterfactual fairness, we derive the conditions and algorithms to identify ancestral relations between variables on a \textit{Partially Directed Acyclic Graph (PDAG)}, specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. Interestingly, we find that counterfactual fairness can be achieved as if the true causal graph were fully known, when specific background knowledge is provided: the sensitive attributes do not have ancestors in the causal graph. Results on both simulated and real-world datasets demonstrate the effectiveness of our method.
    Learning with Stochastic Orders. (arXiv:2205.13684v1 [stat.ML])
    Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. Finally, we provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. The code is available at https://github.com/yair-schiff/stochastic-orders-ICMN
    Transformers from an Optimization Perspective. (arXiv:2205.13891v1 [cs.LG])
    Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
    Off-Beat Multi-Agent Reinforcement Learning. (arXiv:2205.13718v1 [cs.MA])
    We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are executed immediately after inference, which is often unrealistic and can lead to catastrophic failure for multi-agent coordination with off-beat actions. In order to fill this gap, we develop an algorithmic framework for MARL with off-beat actions. We then propose a novel episodic memory, LeGEM, for model-free MARL algorithms. LeGEM builds agents' episodic memories by utilizing agents' individual experiences. It boosts multi-agent learning by addressing the challenging temporal credit assignment problem raised by the off-beat actions via our novel reward redistribution scheme, alleviating the issue of non-Markovian reward. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks. Empirical results show that LeGEM significantly boosts multi-agent coordination and achieves leading performance and improved sample efficiency.
    HOUDINI: Escaping from Moderately Constrained Saddles. (arXiv:2205.13753v1 [cs.LG])
    We give the first polynomial time algorithms for escaping from high-dimensional saddle points under a moderate number of constraints. Given gradient access to a smooth function $f \colon \mathbb R^d \to \mathbb R$ we show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints. This constitutes the first tangible progress (without reliance on NP-oracles or altering the definitions to only account for certain constraints) on the main open question of the breakthrough work of Ge et al. who showed an analogous result for unconstrained and equality-constrained problems. Our results hold for both regular and stochastic gradient descent.
    End-to-End Learning of Hybrid Inverse Dynamics Models for Precise and Compliant Impedance Control. (arXiv:2205.13804v1 [cs.RO])
    It is well-known that inverse dynamics models can improve tracking performance in robot control. These models need to precisely capture the robot dynamics, which consist of well-understood components, e.g., rigid body dynamics, and effects that remain challenging to capture, e.g., stick-slip friction and mechanical flexibilities. Such effects exhibit hysteresis and partial observability, rendering them, particularly challenging to model. Hence, hybrid models, which combine a physical prior with data-driven approaches are especially well-suited in this setting. We present a novel hybrid model formulation that enables us to identify fully physically consistent inertial parameters of a rigid body dynamics model which is paired with a recurrent neural network architecture, allowing us to capture unmodeled partially observable effects using the network memory. We compare our approach against state-of-the-art inverse dynamics models on a 7 degree of freedom manipulator. Using data sets obtained through an optimal experiment design approach, we study the accuracy of offline torque prediction and generalization capabilities of joint learning methods. In control experiments on the real system, we evaluate the model as a feed-forward term for impedance control and show the feedback gains can be drastically reduced to achieve a given tracking accuracy.
    Multivariate Probabilistic Forecasting of Intraday Electricity Prices using Normalizing Flows. (arXiv:2205.13826v1 [cs.LG])
    Electricity is traded on various markets with different time horizons and regulations. Short-term trading becomes increasingly important due to higher penetration of renewables. In Germany, the intraday electricity price typically fluctuates around the day-ahead price of the EPEX spot markets in a distinct hourly pattern. This work proposes a probabilistic modeling approach that models the intraday price difference to the day-ahead contracts. The model captures the emerging hourly pattern by considering the four 15 min intervals in each day-ahead price interval as a four-dimensional joint distribution. The resulting nontrivial, multivariate price difference distribution is learned using a normalizing flow, i.e., a deep generative model that combines conditional multivariate density estimation and probabilistic regression. The normalizing flow is compared to a selection of historical data, a Gaussian copula, and a Gaussian regression model. Among the different models, the normalizing flow identifies the trends most accurately and has the narrowest prediction intervals. Notably, the normalizing flow is the only approach that identifies rare price peaks. Finally, this work discusses the influence of different external impact factors and finds that, individually, most of these factors have negligible impact. Only the immediate history of the price difference realization and the combination of all input factors lead to notable improvements in the forecasts.
    Comparison of Deep Learning Segmentation and Multigrader-annotated Mandibular Canals of Multicenter CBCT scans. (arXiv:2205.13874v1 [cs.LG])
    Deep learning approach has been demonstrated to automatically segment the bilateral mandibular canals from CBCT scans, yet systematic studies of its clinical and technical validation are scarce. To validate the mandibular canal localization accuracy of a deep learning system (DLS) we trained it with 982 CBCT scans and evaluated using 150 scans of five scanners from clinical workflow patients of European and Southeast Asian Institutes, annotated by four radiologists. The interobserver variability was compared to the variability between the DLS and the radiologists. In addition, the generalization of DLS to CBCT scans from scanners not used in the training data was examined to evaluate the out-of-distribution generalization capability. The DLS had lower variability to the radiologists than the interobserver variability between them and it was able to generalize to three new devices. For the radiologists' consensus segmentation, used as gold standard, the DLS had a symmetric mean curve distance of 0.39 mm compared to those of the individual radiologists with 0.62 mm, 0.55 mm, 0.47 mm, and 0.42 mm. The DLS showed comparable or slightly better performance in the segmentation of the mandibular canal with the radiologists and generalization capability to new scanners.
    Deep Reinforcement Learning for Distributed and Uncoordinated Cognitive Radios Resource Allocation. (arXiv:2205.13944v1 [cs.LG])
    This paper presents a novel deep reinforcement learning-based resource allocation technique for the multi-agent environment presented by a cognitive radio network where the interactions of the agents during learning may lead to a non-stationary environment. The resource allocation technique presented in this work is distributed, not requiring coordination with other agents. It is shown by considering aspects specific to deep reinforcement learning that the presented algorithm converges in an arbitrarily long time to equilibrium policies in a non-stationary multi-agent environment that results from the uncoordinated dynamic interaction between radios through the shared wireless environment. Simulation results show that the presented technique achieves a faster learning performance compared to an equivalent table-based Q-learning algorithm and is able to find the optimal policy in 99% of cases for a sufficiently long learning time. In addition, simulations show that our DQL approach requires less than half the number of learning steps to achieve the same performance as an equivalent table-based implementation. Moreover, it is shown that the use of a standard single-agent deep reinforcement learning approach may not achieve convergence when used in an uncoordinated interacting multi-radio scenario
    Group GAN. (arXiv:2205.13741v1 [cs.LG])
    Generating multivariate time series is a promising approach for sharing sensitive data in many medical, financial, and IoT applications. A common type of multivariate time series originates from a single source such as the biometric measurements from a medical patient. This leads to complex dynamical patterns between individual time series that are hard to learn by typical generation models such as GANs. There is valuable information in those patterns that machine learning models can use to better classify, predict or perform other downstream tasks. We propose a novel framework that takes time series' common origin into account and favors inter-channel relationship preservation. The two key points of our method are: 1) the individual time series are generated from a common point in latent space and 2) a central discriminator favors the preservation of inter-channel dynamics. We demonstrate empirically that our method helps preserve channel correlations and that our synthetic data performs very well downstream tasks with medical and financial data.
    Membership Inference Attack Using Self Influence Functions. (arXiv:2205.13680v1 [cs.LG])
    Member inference (MI) attacks aim to determine if a specific data sample was used to train a machine learning model. Thus, MI is a major privacy threat to models trained on private sensitive data, such as medical records. In MI attacks one may consider the black-box settings, where the model's parameters and activations are hidden from the adversary, or the white-box case where they are available to the attacker. In this work, we focus on the latter and present a novel MI attack for it that employs influence functions, or more specifically the samples' self-influence scores, to perform the MI prediction. We evaluate our attack on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, using versatile architectures such as AlexNet, ResNet, and DenseNet. Our attack method achieves new state-of-the-art results for both training with and without data augmentations. Code is available at https://github.com/giladcohen/sif_mi_attack.
    Privacy of Noisy Stochastic Gradient Descent: More Iterations without More Privacy Loss. (arXiv:2205.13710v1 [cs.LG])
    A central issue in machine learning is how to train models on sensitive user data. Industry has widely adopted a simple algorithm: Stochastic Gradient Descent with noise (a.k.a. Stochastic Gradient Langevin Dynamics). However, foundational theoretical questions about this algorithm's privacy loss remain open -- even in the seemingly simple setting of smooth convex losses over a bounded domain. Our main result resolves these questions: for a large range of parameters, we characterize the differential privacy up to a constant factor. This result reveals that all previous analyses for this setting have the wrong qualitative behavior. Specifically, while previous privacy analyses increase ad infinitum in the number of iterations, we show that after a small burn-in period, running SGD longer leaks no further privacy. Our analysis departs completely from previous approaches based on fast mixing, instead using techniques based on optimal transport (namely, Privacy Amplification by Iteration) and the Sampled Gaussian Mechanism (namely, Privacy Amplification by Sampling). Our techniques readily extend to other settings, e.g., strongly convex losses, non-uniform stepsizes, arbitrary batch sizes, and random or cyclic choice of batches.
    DP-PCA: Statistically Optimal and Differentially Private PCA. (arXiv:2205.13709v1 [cs.LG])
    We study the canonical statistical task of computing the principal component from $n$ i.i.d.~data in $d$ dimensions under $(\varepsilon,\delta)$-differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: ($i$) even for Gaussian data, existing private algorithms require the number of samples $n$ to scale super-linearly with $d$, i.e., $n=\Omega(d^{3/2})$, to obtain non-trivial results while non-private PCA requires only $n=O(d)$, and ($ii$) existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on {\em private mean estimation}, which adds minimal noise required to ensure privacy by adapting to the variance of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for $n=\tilde O(d)$. Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.
    Hazard Gradient Penalty for Survival Analysis. (arXiv:2205.13717v1 [cs.LG])
    Survival analysis appears in various fields such as medicine, economics, engineering, and business. Recent studies showed that the Ordinary Differential Equation (ODE) modeling framework unifies many existing survival models while the framework is flexible and widely applicable. However, naively applying the ODE framework to survival analysis problems may model fiercely changing density function which may worsen the model's performance. Though we can apply L1 or L2 regularizers to the ODE model, their effect on the ODE modeling framework is barely known. In this paper, we propose hazard gradient penalty (HGP) to enhance the performance of a survival analysis model. Our method imposes constraints on local data points by regularizing the gradient of hazard function with respect to the data point. Our method applies to any survival analysis model including the ODE modeling framework and is easy to implement. We theoretically show that our method is related to minimizing the KL divergence between the density function at a data point and that of the neighborhood points. Experimental results on three public benchmarks show that our approach outperforms other regularization methods.
    Auto-PINN: Understanding and Optimizing Physics-Informed Neural Architecture. (arXiv:2205.13748v1 [cs.LG])
    Physics-informed neural networks (PINNs) are revolutionizing science and engineering practice by bringing together the power of deep learning to bear on scientific computation. In forward modeling problems, PINNs are meshless partial differential equation (PDE) solvers that can handle irregular, high-dimensional physical domains. Naturally, the neural architecture hyperparameters have a large impact on the efficiency and accuracy of the PINN solver. However, this remains an open and challenging problem because of the large search space and the difficulty of identifying a proper search objective for PDEs. Here, we propose Auto-PINN, the first systematic, automated hyperparameter optimization approach for PINNs, which employs Neural Architecture Search (NAS) techniques to PINN design. Auto-PINN avoids manually or exhaustively searching the hyperparameter space associated with PINNs. A comprehensive set of pre-experiments using standard PDE benchmarks allows us to probe the structure-performance relationship in PINNs. We find that the different hyperparameters can be decoupled, and that the training loss function of PINNs is a good search objective. Comparison experiments with baseline methods demonstrate that Auto-PINN produces neural architectures with superior stability and accuracy over alternative baselines.
    Block-coordinate Frank-Wolfe algorithm and convergence analysis for semi-relaxed optimal transport problem. (arXiv:2205.13766v1 [cs.LG])
    The optimal transport (OT) problem has been used widely for machine learning. It is necessary for computation of an OT problem to solve linear programming with tight mass-conservation constraints. These constraints prevent its application to large-scale problems. To address this issue, loosening such constraints enables us to propose the relaxed-OT method using a faster algorithm. This approach has demonstrated its effectiveness for applications. However, it remains slow. As a superior alternative, we propose a fast block-coordinate Frank-Wolfe (BCFW) algorithm for a convex semi-relaxed OT. Specifically, we prove their upper bounds of the worst convergence iterations, and equivalence between the linearization duality gap and the Lagrangian duality gap. Additionally, we develop two fast variants of the proposed BCFW. Numerical experiments have demonstrated that our proposed algorithms are effective for color transfer and surpass state-of-the-art algorithms. This report presents a short version of arXiv:2103.05857.
    Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap. (arXiv:2203.13457v2 [cs.LG] UPDATED)
    Recently, contrastive learning has risen to be a promising approach for large-scale self-supervised learning. However, theoretical understanding of how it works is still unclear. In this paper, we propose a new guarantee on the downstream performance without resorting to the conditional independence assumption that is widely adopted in previous work but hardly holds in practice. Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations, thus simply aligning the positive samples (augmented views of the same sample) could make contrastive learning cluster intra-class samples together. Based on this augmentation overlap perspective, theoretically, we obtain asymptotically closed bounds for downstream performance under weaker assumptions, and empirically, we propose an unsupervised model selection metric ARC that aligns well with downstream accuracy. Our theory suggests an alternative understanding of contrastive learning: the role of aligning positive samples is more like a surrogate task than an ultimate goal, and the overlapped augmented views (i.e., the chaos) create a ladder for contrastive learning to gradually learn class-separated representations. The code for computing ARC is available at https://github.com/zhangq327/ARC.
    Transformer for Partial Differential Equations' Operator Learning. (arXiv:2205.13671v1 [cs.LG])
    Data-driven learning of partial differential equations' solution operators has recently emerged as a promising paradigm for approximating the underlying solutions. The solution operators are usually parameterized by deep learning models that are built upon problem-specific inductive biases. An example is a convolutional or a graph neural network that exploits the local grid structure where functions' values are sampled. The attention mechanism, on the other hand, provides a flexible way to implicitly exploit the patterns within inputs, and furthermore, relationship between arbitrary query locations and inputs. In this work, we present an attention-based framework for data-driven operator learning, which we term Operator Transformer (OFormer). Our framework is built upon self-attention, cross-attention, and a set of point-wise multilayer perceptrons (MLPs), and thus it makes few assumptions on the sampling pattern of the input function or query locations. We show that the proposed framework is competitive on standard benchmark problems and can flexibly be adapted to randomly sampled input.
    Mixed Federated Learning: Joint Decentralized and Centralized Learning. (arXiv:2205.13655v1 [cs.LG])
    Federated learning (FL) enables learning from decentralized privacy-sensitive data, with computations on raw data confined to take place at edge clients. This paper introduces mixed FL, which incorporates an additional loss term calculated at the coordinating server (while maintaining FL's private data restrictions). There are numerous benefits. For example, additional datacenter data can be leveraged to jointly learn from centralized (datacenter) and decentralized (federated) training data and better match an expected inference data distribution. Mixed FL also enables offloading some intensive computations (e.g., embedding regularization) to the server, greatly reducing communication and client computation load. For these and other mixed FL use cases, we present three algorithms: PARALLEL TRAINING, 1-WAY GRADIENT TRANSFER, and 2-WAY GRADIENT TRANSFER. We state convergence bounds for each, and give intuition on which are suited to particular mixed FL problems. Finally we perform extensive experiments on three tasks, demonstrating that mixed FL can blend training data to achieve an oracle's accuracy on an inference distribution, and can reduce communication and computation overhead by over 90%. Our experiments confirm theoretical predictions of how algorithms perform under different mixed FL problem settings.
    Fight Poison with Poison: Detecting Backdoor Poison Samples via Decoupling Benign Correlations. (arXiv:2205.13616v1 [cs.LG])
    In this work, we study poison samples detection for defending against backdoor poisoning attacks on deep neural networks (DNNs). A principled idea underlying prior arts on this problem is to utilize the backdoored models' distinguishable behaviors on poison and clean populations to distinguish between these two different populations themselves and remove the identified poison. Many prior arts build their detectors upon a latent separability assumption, which states that backdoored models trained on the poisoned dataset will learn separable latent representations for backdoor and clean samples. Although such separation behaviors empirically exist for many existing attacks, there is no control on the separability and the extent of separation can vary a lot across different poison strategies, datasets, as well as the training configurations of backdoored models. Worse still, recent adaptive poison strategies can greatly reduce the "distinguishable behaviors" and consequently render most prior arts less effective (or completely fail). We point out that these limitations directly come from the passive reliance on some distinguishable behaviors that are not controlled by defenders. To mitigate such limitations, in this work, we propose the idea of active defense -- rather than passively assuming backdoored models will have certain distinguishable behaviors on poison and clean samples, we propose to actively enforce the trained models to behave differently on these two different populations. Specifically, we introduce confusion training as a concrete instance of active defense.
    Asymptotic Convergence Rate and Statistical Inference for Stochastic Sequential Quadratic Programming. (arXiv:2205.13687v1 [math.OC])
    We apply a stochastic sequential quadratic programming (StoSQP) algorithm to solve constrained nonlinear optimization problems, where the objective is stochastic and the constraints are deterministic. We study a fully stochastic setup, where only a single sample is available in each iteration for estimating the gradient and Hessian of the objective. We allow StoSQP to select a random stepsize $\bar{\alpha}_t$ adaptively, such that $\beta_t\leq \bar{\alpha}_t \leq \beta_t+\chi_t$, where $\beta_t$, $\chi_t=o(\beta_t)$ are prespecified deterministic sequences. We also allow StoSQP to solve Newton system inexactly via randomized iterative solvers, e.g., with the sketch-and-project method; and we do not require the approximation error of inexact Newton direction to vanish. For this general StoSQP framework, we establish the asymptotic convergence rate for its last iterate, with the worst-case iteration complexity as a byproduct; and we perform statistical inference. In particular, with proper decaying $\beta_t,\chi_t$, we show that: (i) the StoSQP scheme can take at most $O(1/\epsilon^4)$ iterations to achieve $\epsilon$-stationarity; (ii) asymptotically and almost surely, $\|(x_t -x^\star, \lambda_t - \lambda^\star)\| = O(\sqrt{\beta_t\log(1/\beta_t)})+O(\chi_t/\beta_t)$, where $(x_t,\lambda_t)$ is the primal-dual StoSQP iterate; (iii) the sequence $1/\sqrt{\beta_t}\cdot (x_t -x^\star, \lambda_t - \lambda^\star)$ converges to a mean zero Gaussian distribution with a nontrivial covariance matrix. Moreover, we establish the Berry-Esseen bound for $(x_t, \lambda_t)$ to measure quantitatively the convergence of its distribution function. We also provide a practical estimator for the covariance matrix, from which the confidence intervals of $(x^\star, \lambda^\star)$ can be constructed using iterates $\{(x_t,\lambda_t)\}_t$. Our theorems are validated using nonlinear problems in CUTEst test set.  ( 2 min )
    Safety Aware Changepoint Detection for Piecewise i.i.d. Bandits. (arXiv:2205.13689v1 [cs.LG])
    In this paper, we consider the setting of piecewise i.i.d. bandits under a safety constraint. In this piecewise i.i.d. setting, there exists a finite number of changepoints where the mean of some or all arms change simultaneously. We introduce the safety constraint studied in \citet{wu2016conservative} to this setting such that at any round the cumulative reward is above a constant factor of the default action reward. We propose two actively adaptive algorithms for this setting that satisfy the safety constraint, detect changepoints, and restart without the knowledge of the number of changepoints or their locations. We provide regret bounds for our algorithms and show that the bounds are comparable to their counterparts from the safe bandit and piecewise i.i.d. bandit literature. We also provide the first matching lower bounds for this setting. Empirically, we show that our safety-aware algorithms perform similarly to the state-of-the-art actively adaptive algorithms that do not satisfy the safety constraint.  ( 2 min )
    Contextual Adapters for Personalized Speech Recognition in Neural Transducers. (arXiv:2205.13660v1 [cs.CL])
    Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data. A standard way to address this issue is with shallow fusion methods at inference time. However, due to their dependence on external language models and the deterministic approach to weight boosting, their performance is limited. In this paper, we propose training neural contextual adapters for personalization in neural transducer based ASR models. Our approach can not only bias towards user-defined words, but also has the flexibility to work with pretrained ASR models. Using an in-house dataset, we demonstrate that contextual adapters can be applied to any general purpose pretrained ASR model to improve personalization. Our method outperforms shallow fusion, while retaining functionality of the pretrained models by not altering any of the model weights. We further show that the adapter style training is superior to full-fine-tuning of the ASR models on datasets with user-defined content.  ( 2 min )
    SeedGNN: Graph Neural Networks for Supervised Seeded Graph Matching. (arXiv:2205.13679v1 [cs.LG])
    Recently, there have been significant interests in designing Graph Neural Networks (GNNs) for seeded graph matching, which aims to match two (unlabeled) graphs using only topological information and a small set of seeds. However, most previous GNN architectures for seeded graph matching employ a semi-supervised approach, which learns from only the seed set in a single pair of graphs, and therefore does not attempt to learn from many training examples/graphs to best match future unseen graphs. In contrast, this paper is the first to propose a supervised approach for seeded graph matching, which had so far only been used for seedless graph matching. Our proposed SeedGNN architecture employs a number of novel design choices that are inspired by theoretical studies of seeded graph matching. First, SeedGNN can easily learn the capability of counting and using witnesses of different hops, in a way that can be generalized to graphs with different sizes. Second, SeedGNN can use easily-matched pairs as new seeds to percolate and match other nodes. We evaluate SeedGNN on both synthetic and real graphs, and demonstrate significant performance improvement over both non-learning and learning algorithms in the existing literature. Further, our experiments confirm that the knowledge learned by SeedGNN from training graphs can be generalized to test graphs with different sizes and categories.  ( 2 min )
    Consistent and fast inference in compartmental models of epidemics using Poisson Approximate Likelihoods. (arXiv:2205.13602v1 [stat.ME])
    Addressing the challenge of scaling-up epidemiological inference to complex and heterogeneous models, we introduce Poisson Approximate Likelihood (PAL) methods. In contrast to the popular ODE approach to compartmental modelling, in which a large population limit is used to motivate a deterministic model, PALs are derived from approximate filtering equations for finite-population, stochastic compartmental models, and the large population limit drives the consistency of maximum PAL estimators. Our theoretical results appear to be the first likelihood-based parameter estimation consistency results applicable across a broad class of partially observed stochastic compartmental models. Compared to simulation-based methods such as Approximate Bayesian Computation and Sequential Monte Carlo, PALs are simple to implement, involving only elementary arithmetic operations and no tuning parameters; and fast to evaluate, requiring no simulation from the model and having computational cost independent of population size. Through examples, we demonstrate how PALs can be: embedded within Delayed Acceptance Particle Markov Chain Monte Carlo to facilitate Bayesian inference; used to fit an age-structured model of influenza, taking advantage of automatic differentiation in Stan; and applied to calibrate a spatial meta-population model of measles.  ( 2 min )
    Approximate Q-learning and SARSA(0) under the $\epsilon$-greedy Policy: a Differential Inclusion Analysis. (arXiv:2205.13617v1 [cs.LG])
    Q-learning and SARSA(0) with linear function approximation, under $\epsilon$-greedy exploration, are leading methods to estimate the optimal policy in Reinforcement Learning (RL). It has been empirically known that the discontinuous nature of the greedy policies causes these algorithms to exhibit complex phenomena such as i.) instability, ii.) policy oscillation and chattering, iii.) multiple attractors, and iv.) worst policy convergence. However, the literature lacks a formal recipe to explain these behaviors and this has been a long-standing open problem (Sutton, 1999). Our work addresses this by building the necessary mathematical framework using stochastic recursive inclusions and Differential Inclusions (DIs). From this novel viewpoint, our main result states that these approximate algorithms asymptotically converge to suitable invariant sets of DIs instead of differential equations, as is common elsewhere in RL. Furthermore, the nature of these deterministic DIs completely governs the limiting behaviors of these algorithms.  ( 2 min )
    Incorporating the Barzilai-Borwein Adaptive Step Size into Sugradient Methods for Deep Network Training. (arXiv:2205.13711v1 [cs.LG])
    In this paper, we incorporate the Barzilai-Borwein step size into gradient descent methods used to train deep networks. This allows us to adapt the learning rate using a two-point approximation to the secant equation which quasi-Newton methods are based upon. Moreover, the adaptive learning rate method presented here is quite general in nature and can be applied to widely used gradient descent approaches such as Adagrad and RMSprop. We evaluate our method using standard example network architectures on widely available datasets and compare against alternatives elsewhere in the literature. In our experiments, our adaptive learning rate shows a smoother and faster convergence than that exhibited by the alternatives, with better or comparable performance.  ( 2 min )
    A Unified Analysis of Federated Learning with Arbitrary Client Participation. (arXiv:2205.13648v1 [cs.LG])
    Federated learning (FL) faces challenges of intermittent client availability and computation/communication efficiency. As a result, only a small subset of clients can participate in FL at a given time. It is important to understand how partial client participation affects convergence, but most existing works have either considered idealized participation patterns or obtained results with non-zero optimality error for generic patterns. In this paper, we provide a unified convergence analysis for FL with arbitrary client participation. We first introduce a generalized version of federated averaging (FedAvg) that amplifies parameter updates at an interval of multiple FL rounds. Then, we present a novel analysis that captures the effect of client participation in a single term. By analyzing this term, we obtain convergence upper bounds for a wide range of participation patterns, including both non-stochastic and stochastic cases, which match either the lower bound of stochastic gradient descent (SGD) or the state-of-the-art results in specific settings. We also discuss various insights, recommendations, and experimental results.  ( 2 min )
    Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures. (arXiv:2205.13647v1 [cs.LG])
    This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.  ( 2 min )
    fakeWeather: Adversarial Attacks for Deep Neural Networks Emulating Weather Conditions on the Camera Lens of Autonomous Systems. (arXiv:2205.13807v1 [cs.LG])
    Recently, Deep Neural Networks (DNNs) have achieved remarkable performances in many applications, while several studies have enhanced their vulnerabilities to malicious attacks. In this paper, we emulate the effects of natural weather conditions to introduce plausible perturbations that mislead the DNNs. By observing the effects of such atmospheric perturbations on the camera lenses, we model the patterns to create different masks that fake the effects of rain, snow, and hail. Even though the perturbations introduced by our attacks are visible, their presence remains unnoticed due to their association with natural events, which can be especially catastrophic for fully-autonomous and unmanned vehicles. We test our proposed fakeWeather attacks on multiple Convolutional Neural Network and Capsule Network models, and report noticeable accuracy drops in the presence of such adversarial perturbations. Our work introduces a new security threat for DNNs, which is especially severe for safety-critical applications and autonomous systems.  ( 2 min )
    A Hybrid Neural Autoencoder for Sensory Neuroprostheses and Its Applications in Bionic Vision. (arXiv:2205.13623v1 [cs.LG])
    Sensory neuroprostheses are emerging as a promising technology to restore lost sensory function or augment human capacities. However, sensations elicited by current devices often appear artificial and distorted. Although current models can often predict the neural or perceptual response to an electrical stimulus, an optimal stimulation strategy solves the inverse problem: what is the required stimulus to produce a desired response? Here we frame this as an end-to-end optimization problem, where a deep neural network encoder is trained to invert a known, fixed forward model that approximates the underlying biological system. As a proof of concept, we demonstrate the effectiveness of our hybrid neural autoencoder (HNA) on the use case of visual neuroprostheses. We found that HNA is able to produce high-fidelity stimuli from the MNIST and COCO datasets that outperform conventional encoding strategies and surrogate techniques across all tested conditions. Overall this is an important step towards the long-standing challenge of restoring high-quality vision to people living with incurable blindness and may prove a promising solution for a variety of neuroprosthetic technologies.  ( 2 min )
    Error Bound of Empirical $\ell_2$ Risk Minimization for Noisy Standard and Generalized Phase Retrieval Problems. (arXiv:2205.13827v1 [stat.ML])
    A noisy generalized phase retrieval (NGPR) problem refers to a problem of estimating $x_0 \in \mathbb{C}^d$ by noisy quadratic samples $\big\{x_0^*A_kx_0+\eta_k\big\}_{k=1}^n$ where $A_k$ is a Hermitian matrix and $\eta_k$ is a noise scalar. When $A_k=\alpha_k\alpha_k^*$ for some $\alpha_k\in\mathbb{C}^d$, it reduces to a standard noisy phase retrieval (NPR) problem. The main aim of this paper is to study the estimation performance of empirical $\ell_2$ risk minimization in both problems when $A_k$ in NGPR, or $\alpha_k$ in NPR, is drawn from sub-Gaussian distribution. Under different kinds of noise patterns, we establish error bounds that can imply approximate reconstruction and these results are new in the literature. In NGPR, we show the bounds are of $O\big(\frac{||\eta||}{\sqrt{n}}\big)$ and $O\big(||\eta||_\infty \sqrt{\frac{d}{n}}\big)$ for general noise, and of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$ for random noise with sub-Gaussian and sub-exponential tail respectively, where $\| \eta \|$ and $\| \eta \|_{\infty}$ are the 2-norm and sup-norm of the noise vector of $\eta_k$. Under heavy-tailed noise, by truncating response outliers we propose a robust estimator that possesses an error bound with slower convergence rate. On the other hand, we obtain in NPR the bound is of $O\big(\sqrt{\frac{d\log n}{n}}\big)$ and $O\big(\sqrt{\frac{d(\log n)^2}{n}}\big)$) for sub-Gaussian and sub-exponential noise respectively, which is essentially tighter than the existing bound $O\big(\frac{||\eta||_2}{\sqrt{n}}\big)$. Although NGPR involving measurement matrix $A_k$ is more computationally demanding than NPR involving measurement vector $\alpha_k$, our results reveal that NGPR exhibits stronger robustness than NPR under biased and deterministic noise. Experimental results are presented to confirm and demonstrate our theoretical findings.
    TraClets: Harnessing the power of computer vision for trajectory classification. (arXiv:2205.13880v1 [cs.CV])
    Due to the advent of new mobile devices and tracking sensors in recent years, huge amounts of data are being produced every day. Therefore, novel methodologies need to emerge that dive through this vast sea of information and generate insights and meaningful information. To this end, researchers have developed several trajectory classification algorithms over the years that are able to annotate tracking data. Similarly, in this research, a novel methodology is presented that exploits image representations of trajectories, called TraClets, in order to classify trajectories in an intuitive humans way, through computer vision techniques. Several real-world datasets are used to evaluate the proposed approach and compare its classification performance to other state-of-the-art trajectory classification algorithms. Experimental results demonstrate that TraClets achieves a classification performance that is comparable to, or in most cases, better than the state-of-the-art, acting as a universal, high-accuracy approach for trajectory classification.
    Explaining Preferences with Shapley Values. (arXiv:2205.13662v1 [stat.ML])
    While preference modelling is becoming one of the pillars of machine learning, the problem of preference explanation remains challenging and underexplored. In this paper, we propose \textsc{Pref-SHAP}, a Shapley value-based model explanation framework for pairwise comparison data. We derive the appropriate value functions for preference models and further extend the framework to model and explain \emph{context specific} information, such as the surface type in a tennis game. To demonstrate the utility of \textsc{Pref-SHAP}, we apply our method to a variety of synthetic and real-world datasets and show that richer and more insightful explanations can be obtained over the baseline.  ( 2 min )
    FedAvg with Fine Tuning: Local Updates Lead to Representation Learning. (arXiv:2205.13692v1 [cs.LG])
    The Federated Averaging (FedAvg) algorithm, which consists of alternating between a few local stochastic gradient updates at client nodes, followed by a model averaging update at the server, is perhaps the most commonly used method in Federated Learning. Notwithstanding its simplicity, several empirical studies have illustrated that the output model of FedAvg, after a few fine-tuning steps, leads to a model that generalizes well to new unseen tasks. This surprising performance of such a simple method, however, is not fully understood from a theoretical point of view. In this paper, we formally investigate this phenomenon in the multi-task linear representation setting. We show that the reason behind generalizability of the FedAvg's output is its power in learning the common data representation among the clients' tasks, by leveraging the diversity among client data distributions via local updates. We formally establish the iteration complexity required by the clients for proving such result in the setting where the underlying shared representation is a linear map. To the best of our knowledge, this is the first such result for any setting. We also provide empirical evidence demonstrating FedAvg's representation learning ability in federated image classification with heterogeneous data.
    Efficient Approximation of Gromov-Wasserstein Distance using Importance Sparsification. (arXiv:2205.13573v1 [cs.LG])
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for the matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to its high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called Spar-GW, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. We demonstrate that the proposed Spar-GW method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(n^{2+\delta})$ for an arbitrary small $\delta>0$. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our Spar-GW to state-of-the-art methods in both synthetic and real-world tasks.  ( 2 min )
    Learning in Feedback-driven Recurrent Spiking Neural Networks using full-FORCE Training. (arXiv:2205.13585v1 [cs.AI])
    Feedback-driven recurrent spiking neural networks (RSNNs) are powerful computational models that can mimic dynamical systems. However, the presence of a feedback loop from the readout to the recurrent layer de-stabilizes the learning mechanism and prevents it from converging. Here, we propose a supervised training procedure for RSNNs, where a second network is introduced only during the training, to provide hint for the target dynamics. The proposed training procedure consists of generating targets for both recurrent and readout layers (i.e., for a full RSNN system). It uses the recursive least square-based First-Order and Reduced Control Error (FORCE) algorithm to fit the activity of each layer to its target. The proposed full-FORCE training procedure reduces the amount of modifications needed to keep the error between the output and target close to zero. These modifications control the feedback loop, which causes the training to converge. We demonstrate the improved performance and noise robustness of the proposed full-FORCE training procedure to model 8 dynamical systems using RSNNs with leaky integrate and fire (LIF) neurons and rate coding. For energy-efficient hardware implementation, an alternative time-to-first-spike (TTFS) coding is implemented for the full- FORCE training procedure. Compared to rate coding, full-FORCE with TTFS coding generates fewer spikes and facilitates faster convergence to the target dynamics.  ( 2 min )
    Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment. (arXiv:2205.13566v1 [cs.LG])
    Multi-armed bandit (MAB) is a classic model for understanding the exploration-exploitation trade-off. The traditional MAB model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok and YouTube Shorts, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where "A" stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not like the previous item. We prove that both ULCB and KL-ULCB achieve logarithmic regret, $O(\log K)$, where $K$ is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results confirm our theoretical analysis and show that the proposed algorithms have significantly lower regrets than the traditional UCB and KL-UCB, and Q-learning-based algorithms.  ( 2 min )
    Evolution of beliefs in social networks. (arXiv:2205.13587v1 [cs.LG])
    Evolution of beliefs of a society are a product of interactions between people (horizontal transmission) in the society over generations (vertical transmission). Researchers have studied both horizontal and vertical transmission separately. Extending prior work, we propose a new theoretical framework which allows application of tools from Markov chain theory to the analysis of belief evolution via horizontal and vertical transmission. We analyze three cases: static network, randomly changing network, and homophily-based dynamic network. Whereas the former two assume network structure is independent of beliefs, the latter assumes that people tend to communicate with those who have similar beliefs. We prove under general conditions that both static and randomly changing networks converge to a single set of beliefs among all individuals along with the rate of convergence. We prove that homophily-based network structures do not in general converge to a single set of beliefs shared by all and prove lower bounds on the number of different limiting beliefs as a function of initial beliefs. We conclude by discussing implications for prior theories and directions for future work.  ( 2 min )
    Learning Dialogue Representations from Consecutive Utterances. (arXiv:2205.13568v1 [cs.CL])
    Learning high-quality dialogue representations is essential for solving a variety of dialogue-oriented tasks, especially considering that dialogue systems often suffer from data scarcity. In this paper, we introduce Dialogue Sentence Embedding (DSE), a self-supervised contrastive learning method that learns effective dialogue representations suitable for a wide range of dialogue tasks. DSE learns from dialogues by taking consecutive utterances of the same dialogue as positive pairs for contrastive learning. Despite its simplicity, DSE achieves significantly better representation capability than other dialogue representation and universal sentence representation models. We evaluate DSE on five downstream dialogue tasks that examine dialogue representation at different semantic granularities. Experiments in few-shot and zero-shot settings show that DSE outperforms baselines by a large margin. For example, it achieves 13 average performance improvement over the strongest unsupervised baseline in 1-shot intent classification on 6 datasets. We also provide analyses on the benefits and limitations of our model.  ( 2 min )
    Training and Inference on Any-Order Autoregressive Models the Right Way. (arXiv:2205.13554v1 [cs.LG])
    Conditional inference on arbitrary subsets of variables is a core problem in probabilistic inference with important applications such as masked language modeling and image inpainting. In recent years, the family of Any-Order Autoregressive Models (AO-ARMs) -- which includes popular models such as XLNet -- has shown breakthrough performance in arbitrary conditional tasks across a sweeping range of domains. But, in spite of their success, in this paper we identify significant improvements to be made to previous formulations of AO-ARMs. First, we show that AO-ARMs suffer from redundancy in their probabilistic model, i.e., they define the same distribution in multiple different ways. We alleviate this redundancy by training on a smaller set of univariate conditionals that still maintains support for efficient arbitrary conditional inference. Second, we upweight the training loss for univariate conditionals that are evaluated more frequently during inference. Our method leads to improved performance with no compromises on tractability, giving state-of-the-art likelihoods in arbitrary conditional modeling on text (Text8), image (CIFAR10, ImageNet32), and continuous tabular data domains.  ( 2 min )
    Learning black- and gray-box chemotactic PDEs/closures from agent based Monte Carlo simulation data. (arXiv:2205.13545v1 [q-bio.QM])
    We propose a machine learning framework for the data-driven discovery of macroscopic chemotactic Partial Differential Equations (PDEs) -- and the closures that lead to them -- from high-fidelity, individual-based stochastic simulations of E.coli bacterial motility. The fine scale, detailed, hybrid (continuum - Monte Carlo) simulation model embodies the underlying biophysics, and its parameters are informed from experimental observations of individual cells. We exploit Automatic Relevance Determination (ARD) within a Gaussian Process framework for the identification of a parsimonious set of collective observables that parametrize the law of the effective PDEs. Using these observables, in a second step we learn effective, coarse-grained "Keller-Segel class" chemotactic PDEs using machine learning regressors: (a) (shallow) feedforward neural networks and (b) Gaussian Processes. The learned laws can be black-box (when no prior knowledge about the PDE law structure is assumed) or gray-box when parts of the equation (e.g. the pure diffusion part) is known and "hardwired" in the regression process. We also discuss data-driven corrections (both additive and functional) of analytically known, approximate closures.  ( 2 min )
    Dynamic Network Reconfiguration for Entropy Maximization using Deep Reinforcement Learning. (arXiv:2205.13578v1 [cs.LG])
    A key problem in network theory is how to reconfigure a graph in order to optimize a quantifiable objective. Given the ubiquity of networked systems, such work has broad practical applications in a variety of situations, ranging from drug and material design to telecommunications. The large decision space of possible reconfigurations, however, makes this problem computationally intensive. In this paper, we cast the problem of network rewiring for optimizing a specified structural property as a Markov Decision Process (MDP), in which a decision-maker is given a budget of modifications that are performed sequentially. We then propose a general approach based on the Deep Q-Network (DQN) algorithm and graph neural networks (GNNs) that can efficiently learn strategies for rewiring networks. We then discuss a cybersecurity case study, i.e., an application to the computer network reconfiguration problem for intrusion protection. In a typical scenario, an attacker might have a (partial) map of the system they plan to penetrate; if the network is effectively "scrambled", they would not be able to navigate it since their prior knowledge would become obsolete. This can be viewed as an entropy maximization problem, in which the goal is to increase the surprise of the network. Indeed, entropy acts as a proxy measurement of the difficulty of navigating the network topology. We demonstrate the general ability of the proposed method to obtain better entropy gains than random rewiring on synthetic and real-world graphs while being computationally inexpensive, as well as being able to generalize to larger graphs than those seen during training. Simulations of attack scenarios confirm the effectiveness of the learned rewiring strategies.  ( 2 min )
    Differentially Private Decoding in Large Language Models. (arXiv:2205.13621v1 [cs.CL])
    Recent large-scale natural language processing (NLP) systems use a pre-trained Large Language Model (LLM) on massive and diverse corpora as a headstart. In practice, the pre-trained model is adapted to a wide array of tasks via fine-tuning on task-specific datasets. LLMs, while effective, have been shown to memorize instances of training data thereby potentially revealing private information processed during pre-training. The potential leakage might further propagate to the downstream tasks for which LLMs are fine-tuned. On the other hand, privacy-preserving algorithms usually involve retraining from scratch, which is prohibitively expensive for LLMs. In this work, we propose a simple, easy to interpret, and computationally lightweight perturbation mechanism to be applied to an already trained model at the decoding stage. Our perturbation mechanism is model-agnostic and can be used in conjunction with any LLM. We provide theoretical analysis showing that the proposed mechanism is differentially private, and experimental results showing a privacy-utility trade-off.  ( 2 min )
    Tensor Program Optimization with Probabilistic Programs. (arXiv:2205.13603v1 [cs.LG])
    Automatic optimization for tensor programs becomes increasingly important as we deploy deep learning in various environments, and efficient optimization relies on a rich search space and effective search. Most existing efforts adopt a search space which lacks the ability to efficiently enable domain experts to grow the search space. This paper introduces MetaSchedule, a domain-specific probabilistic programming language abstraction to construct a rich search space of tensor programs. Our abstraction allows domain experts to analyze the program, and easily propose stochastic choices in a modular way to compose program transformation accordingly. We also build an end-to-end learning-driven framework to find an optimized program for a given search space. Experimental results show that MetaSchedule can cover the search space used in the state-of-the-art tensor program optimization frameworks in a modular way. Additionally, it empowers domain experts to conveniently grow the search space and modularly enhance the system, which brings 48% speedup on end-to-end deep learning workloads.  ( 2 min )
    Denial-of-Service Attack on Object Detection Model Using Universal Adversarial Perturbation. (arXiv:2205.13618v1 [cs.CV])
    Adversarial attacks against deep learning-based object detectors have been studied extensively in the past few years. The proposed attacks aimed solely at compromising the models' integrity (i.e., trustworthiness of the model's prediction), while adversarial attacks targeting the models' availability, a critical aspect in safety-critical domains such as autonomous driving, have not been explored by the machine learning research community. In this paper, we propose NMS-Sponge, a novel approach that negatively affects the decision latency of YOLO, a state-of-the-art object detector, and compromises the model's availability by applying a universal adversarial perturbation (UAP). In our experiments, we demonstrate that the proposed UAP is able to increase the processing time of individual frames by adding "phantom" objects while preserving the detection of the original objects.  ( 2 min )
    Fairness in Recommendation: A Survey. (arXiv:2205.13619v1 [cs.IR])
    As one of the most pervasive applications of machine learning, recommender systems are playing an important role on assisting human decision making. The satisfaction of users and the interests of platforms are closely related to the quality of the generated recommendation results. However, as a highly data-driven system, recommender system could be affected by data or algorithmic bias and thus generate unfair results, which could weaken the reliance of the systems. As a result, it is crucial to address the potential unfairness problems in recommendation settings. Recently, there has been growing attention on fairness considerations in recommender systems with more and more literature on approaches to promote fairness in recommendation. However, the studies are rather fragmented and lack a systematic organization, thus making it difficult to penetrate for new researchers to the domain. This motivates us to provide a systematic survey of existing works on fairness in recommendation. This survey focuses on the foundations for fairness in recommendation literature. It first presents a brief introduction about fairness in basic machine learning tasks such as classification and ranking in order to provide a general overview of fairness research, as well as introduce the more complex situations and challenges that need to be considered when studying fairness in recommender systems. After that, the survey will introduce fairness in recommendation with a focus on the taxonomies of current fairness definitions, the typical techniques for improving fairness, as well as the datasets for fairness studies in recommendation. The survey also talks about the challenges and opportunities in fairness research with the hope of promoting the fair recommendation research area and beyond.  ( 2 min )
    Circumventing Backdoor Defenses That Are Based on Latent Separability. (arXiv:2205.13613v1 [cs.LG])
    Deep learning models are vulnerable to backdoor poisoning attacks. In particular, adversaries can embed hidden backdoors into a model by only modifying a very small portion of its training data. On the other hand, it has also been commonly observed that backdoor poisoning attacks tend to leave a tangible signature in the latent space of the backdoored model i.e. poison samples and clean samples form two separable clusters in the latent space. These observations give rise to the popularity of latent separability assumption, which states that the backdoored DNN models will learn separable latent representations for poison and clean populations. A number of popular defenses (e.g. Spectral Signature, Activation Clustering, SCAn, etc.) are exactly built upon this assumption. However, in this paper, we show that the latent separation can be significantly suppressed via designing adaptive backdoor poisoning attacks with more sophisticated poison strategies, which consequently render state-of-the-art defenses based on this assumption less effective (and often completely fail). More interestingly, we find that our adaptive attacks can even evade some other typical backdoor defenses that do not explicitly build on this separability assumption. Our results show that adaptive backdoor poisoning attacks that can breach the latent separability assumption should be seriously considered for evaluating existing and future defenses.  ( 2 min )
    Self-supervised Pretraining and Transfer Learning Enable Flu and COVID-19 Predictions in Small Mobile Sensing Datasets. (arXiv:2205.13607v1 [cs.LG])
    Detailed mobile sensing data from phones, watches, and fitness trackers offer an unparalleled opportunity to quantify and act upon previously unmeasurable behavioral changes in order to improve individual health and accelerate responses to emerging diseases. Unlike in natural language processing and computer vision, deep representation learning has yet to broadly impact this domain, in which the vast majority of research and clinical applications still rely on manually defined features and boosted tree models or even forgo predictive modeling altogether due to insufficient accuracy. This is due to unique challenges in the behavioral health domain, including very small datasets (~10^1 participants), which frequently contain missing data, consist of long time series with critical long-range dependencies (length>10^4), and extreme class imbalances (>10^3:1).  ( 2 min )
    Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes. (arXiv:2205.13589v1 [cs.LG])
    We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.  ( 2 min )
    Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. (arXiv:2205.13571v1 [cs.LG])
    Neural networks have achieved tremendous success in a large variety of applications. However, their memory footprint and computational demand can render them impractical in application settings with limited hardware or energy resources. In this work, we propose a novel algorithm to find efficient low-rank subnetworks. Remarkably, these subnetworks are determined and adapted already during the training phase and the overall time and memory resources required by both training and evaluating them is significantly reduced. The main idea is to restrict the weight matrices to a low-rank manifold and to update the low-rank factors rather than the full matrix during training. To derive training updates that are restricted to the prescribed manifold, we employ techniques from dynamic model order reduction for matrix differential equations. Moreover, our method automatically and dynamically adapts the ranks during training to achieve a desired approximation accuracy. The efficiency of the proposed method is demonstrated through a variety of numerical experiments on fully-connected and convolutional networks.  ( 2 min )
    Pruning has a disparate impact on model accuracy. (arXiv:2205.13574v1 [cs.LG])
    Network pruning is a widely-used compression technique that is able to significantly scale down overparameterized models with minimal loss of accuracy. This paper shows that pruning may create or exacerbate disparate impacts. The paper sheds light on the factors to cause such disparities, suggesting differences in gradient norms and distance to decision boundary across groups to be responsible for this critical issue. It analyzes these factors in detail, providing both theoretical and empirical support, and proposes a simple, yet effective, solution that mitigates the disparate impacts caused by pruning.  ( 2 min )
    Understanding new tasks through the lens of training data via exponential tilting. (arXiv:2205.13577v1 [cs.LG])
    Deploying machine learning models to new tasks is a major challenge despite the large size of the modern training datasets. However, it is conceivable that the training data can be reweighted to be more representative of the new (target) task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.  ( 2 min )
    Unequal Covariance Awareness for Fisher Discriminant Analysis and Its Variants in Classification. (arXiv:2205.13565v1 [cs.LG])
    Fisher Discriminant Analysis (FDA) is one of the essential tools for feature extraction and classification. In addition, it motivates the development of many improved techniques based on the FDA to adapt to different problems or data types. However, none of these approaches make use of the fact that the assumption of equal covariance matrices in FDA is usually not satisfied in practical situations. Therefore, we propose a novel classification rule for the FDA that accounts for this fact, mitigating the effect of unequal covariance matrices in the FDA. Furthermore, since we only modify the classification rule, the same can be applied to many FDA variants, improving these algorithms further. Theoretical analysis reveals that the new classification rule allows the implicit use of the class covariance matrices while increasing the number of parameters to be estimated by a small amount compared to going from FDA to Quadratic Discriminant Analysis. We illustrate our idea via experiments, which show the superior performance of the modified algorithms based on our new classification rule compared to the original ones.  ( 2 min )
    Predictor-corrector algorithms for stochastic optimization under gradual distribution shift. (arXiv:2205.13575v1 [cs.LG])
    Time-varying stochastic optimization problems frequently arise in machine learning practice (e.g. gradual domain shift, object tracking, strategic classification). Although most problems are solved in discrete time, the underlying process is often continuous in nature. We exploit this underlying continuity by developing predictor-corrector algorithms for time-varying stochastic optimizations. We provide error bounds for the iterates, both in presence of pure and noisy access to the queries from the relevant derivatives of the loss function. Furthermore, we show (theoretically and empirically in several examples) that our method outperforms non-predictor corrector methods that do not exploit the underlying continuous process.  ( 2 min )

  • Open

    Update on previous post, this actually scared me
    submitted by /u/Varitiuss29 [link] [comments]
    Translate a pdf using gpt3
    submitted by /u/Varitiuss29 [link] [comments]
    Baidu AI Researchers Introduce SE-MoE That Proposes Elastic MoE Training With 2D Prefetch And Fusion Communication Over Hierarchical Storage
    Machine learning and deep learning have gained popularity in domains, like computer vision (CV) and natural language processing (NLP), which require analyzing large amounts of data such as images and text. As a result, many computational resources are needed for data processing. Thus, to address the above concern, sparsely activated neural networks based on Mixture-of-Experts (MoE) have been utilized for training the larger models with low or no supplementary computational resources while achieving improved training results. Accordingly, to overcome the challenges faced by the MoE, this paper proposes an innovative amalgamated framework for MoE training and inference. The paper’s significant contribution includes a novel SE-MoE, a distributed system capable of scaling MoE models to trillions of parameters and completely exploiting the clusters, including High Bandwidth Memory, CPU memory, and SSDs in achieving effective training scheduling. Dynamic graph scheduling utilizes an innovative inference approach based on ring memory to overlap computation and communication as much as feasible, resulting in more efficient inference performance without extra machines for larger-scale MoE models. Additionally, various methods like load balancing are utilized by the SE-MoE to advance the performance without any additional resources. Continue reading | Check out the paper, and Github submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Software’s Job
    Hi group. Is it possible to use an image of a form to fullfill the same form but in a web application? That is, the software has to recognize the form’s controls and then trough an API or something, fullfill the controls but in the web app. Just looking for feedback and if it is complicated. Thanks, and sorry for my english. submitted by /u/ScallionFunny3560 [link] [comments]  ( 1 min )
    Recursively summarize text of any length with GPT-3
    submitted by /u/DavidKShapiro [link] [comments]
    Doctor Who - 4K Neural Art Exploration
    submitted by /u/MLInsights [link] [comments]
    Self Taught AI engineers
    Is it possible to learn AI only by self taught? What are the best online course's, resource's to master AI? submitted by /u/Titan_D [link] [comments]  ( 1 min )
    Hey there. I, eleventacion/very odd will be posting this AI generated (text to speech) track on my YouTube channel very soon. It's not the best you will hear but it will be pretty pleasing. Love you all.
    submitted by /u/eleventacion [link] [comments]  ( 1 min )
    Google bans deepfake training in Colab
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    Giant Squid - 4K Neural Art Visualization
    submitted by /u/MLInsights [link] [comments]
    Knightian uncertainty
    Knightian uncertainty is basically uncertainty that we can't quantify. For me that seems to be one of the topics where artificial intelligence doesn't make much sense conceptually. But clearly a strong majority in the AI field disagrees that AI has limits in that way. How - in the context of AI - do you conceptually make sense of unquantifiable uncertainty, or related topics that involve unquantifiability and non-representability such as spaciousness (which is more allusive or loose as a concept rather than representing anything per se). If you have meditated before you might also have discovered that experience of spaciousness can have profound impact on the way we act in the world, so I don't think it can be ignored as far as behaviour goes either. It seems if there is such a thing as general and conscious AI those are some foundational issues to make sense of in order to get there. After all experience of space and uncertainty are basically core aspects of our being in the world. submitted by /u/bejaq [link] [comments]  ( 3 min )
  • Open

    Benefits of Power BI for Small Businesses
    There is no way one can deny the importance of data in the business scene. Companies of all types and sizes are using mammoth amounts of data on a daily basis. They need to collect data from various sources and then compile and analyze it using specialized applications. This is why BI and data visualization… Read More »Benefits of Power BI for Small Businesses The post Benefits of Power BI for Small Businesses appeared first on Data Science Central.  ( 4 min )
    Best Practices for Implementing Breadcrumb SEO Strategy for Mobiles
    The design and position of breadcrumb navigation on a webpage is typical and has become an established practice for a long time. However, as the world shifts to a mobile-first web environment, many website designers are getting it wrong or forgetting to include it in their navigation. Doing this can be a blunder because it… Read More »Best Practices for Implementing Breadcrumb SEO Strategy for Mobiles The post Best Practices for Implementing Breadcrumb SEO Strategy for Mobiles appeared first on Data Science Central.  ( 5 min )
  • Open

    [P] Extract formula from ONNX file possible?
    At the end of building a NN, there is a formula. It's a mathematical formula that can be evaluated for given inputs, depending on the trained weights. Is it possible to "extract" or "read" that formula from an ONNX file? When I look at it in a text editor, it's not human-readable. But it should be straight-forward to evaluate for given inputs, that's how inference works: just evaluate the formula. How can I actually *get* that formula? For a not-so-deep NN that mathematical expression should be pretty easy. Thanks! submitted by /u/CantFixMoronic [link] [comments]  ( 1 min )
    [D] Gwern’s Retrospective on the 2 Year Anniversary of GPT3 Release
    submitted by /u/programmerChilli [link] [comments]
    [D] Any state of the art paper close to Google's paper "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers"?
    I am working on this paper and was wondering if there's any other paper that has used the same approaches vis-a-vis image-text retrieval. This paper uses a combination of Transformers with encoders for image retrieval. I can't find the code implementation of this paper anywhere and was wondering maybe a similar paper will help in that regard. submitted by /u/icelebratefestivus [link] [comments]  ( 1 min )
    [N] American Express Default Prediction ML Competition
    https://www.kaggle.com/competitions/amex-default-prediction/overview submitted by /u/EducationalCicada [link] [comments]
  • Open

    How do you limit the high frequency agent actions when dealing with continuous control?
    I am tuning an SAC agent for a robotics control task. The action space of the agent is a single dimensional decision in [-1, 1]. I see that very often the agent takes advantage of the fact that the action can be varied with a very high frequency, basically filling up the plot. I've already implemented an incremental version of the agent, where it actually controls a derivative of the control action and the actual action is part of the observation space, which helps a lot with the realism of the robotics problem. Now the problem has been sort of moved one time-derivative lower and the high frequency content of the action is the rate of change of the control input. Is there a way to do some reward-shaping or some other method to prevent this? I've also tried just straight up adding a penalty term to the absolute value of the action but it comes with degraded performance. submitted by /u/Speterius [link] [comments]  ( 1 min )
    Environments that require long-term memory and reasoning
    Could you recommend me some Reinforcement Learning environments that require long-term memory and reasoning? By long-term memory, I mean environments in which: - Evolution of the environment depends on states or actions experienced in the past, possibly if this dependency is long (i.e. several timesteps) - When re-visiting areas of the state space, these might be different depending on what the agent previously did in those areas Example: a robot does cleaning jobs in a house. When switching between tasks (e.g. cleaning the kitchen, cleaning the bathroom, doing the laundry) it needs to remember where it left the tools used in previous tasks to re-use them in new tasks submitted by /u/fedetask [link] [comments]  ( 1 min )
    Is anyone interested in this project?
    submitted by /u/7NoteDancing [link] [comments]
    Multi agent rl for different action space
    I was wondering which multi rl algorithm would fit the following setting: A robot arm with a gripper, where the arm shell be controlled by a policy and the gripper by another policy. The action space of the two policies is different, but the observation space and the reward function is up to a design choice. For instance both policies could receive all observations or just local observations. Similar both policies could receive the same global reward or individual reward. Is there a paper comparing these approaches? Thanks for feedback submitted by /u/Informal_Temporary91 [link] [comments]  ( 1 min )
    [2205.10316] Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space
    submitted by /u/chimp73 [link] [comments]  ( 1 min )
    Probabilities in payoff matrix
    Hi guys I'm trying to understand how am I supposed to define probabilities to calculate (M&A, 1) and the other ones, I really dont get how. They say to "fix the frequencies pk for the outcome xk, such that the DM is indifferent between xk and the BEST outcome", but I dont get it Hope you can help me, Thanks! https://preview.redd.it/so23m2038e291.png?width=865&format=png&auto=webp&s=5a09cc65a7400e37ffee275f5f333de06085fab1 submitted by /u/Giorgio_v1 [link] [comments]  ( 1 min )
  • Open

    AI-generated donuts
    If you're going to open a late-night donut shop, you're going to need a unique set of over-the-top donuts to set the proper festive atmosphere. But how to keep the ideas coming? I decided to see what donut ideas I could get using OpenAI's  ( 3 min )
    Bonus: Donuts that will possibly end the planet
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Detecting Wildfires with Image Analysis
    Just read an article with a firefighter in AK, USA who said "you don't know about them until there's smoke". It gave me an idea that is either pointless or somewhat meaningful, and I didn't know if it was feasible, so I came here! The basic idea is to use a neural network to detect wildfires at a state wide, or nation wide scale. The firefighter also said that these natural, lightning caused fires could be active for 2-5 days before they know about it. They also said that the delayed response is because these usually happen in remote areas. With many of the wildfires being caused by humans (maybe not in AK but in other parts of the country), along with natural occurring wildfires, would the response be just as slow in less remote states such as California , Oregon, or Washington? If they were much quicker in more populated states, then I don't think using having a detection system besides humans confirming it or reporting it would be necessary, but if Alaska ever wanted to have a better way of detecting fires then why not use satellites to take pictures every hour or something, and send the data back to some neural network or AI that could determine whether there's smoke on the screen or not, and then have a human check it out to see if its a fire or not. I understand that clouds would look alot like smoke, but I'm thinking there's gotta be a way to train an image recognition deep neural network to differentiate the two. If it could that, then the response to wildfires might change from days to hours. It would probably cost a shit ton of money to send satellites out into space, along with the means of transferring that much data from the salleties to a neural network on the ground, but if it saves thousands of acres of wildlife, it could be worth it in future years when it's cheaper to do space stuff. This is purely an idea from someone who knows nothing about all of this. Just an idea. Any discussion is welcome! Thanks for reading :) submitted by /u/foxypablo [link] [comments]  ( 2 min )

  • Open

    [D] Weird trend in machine learning: papers tackle easy problems are well cited
    I don't know if anyone else have encountered this, but I have seen a lot of ML papers with extremely ideal assumptions (sometimes hardly relevant to machine learning) and then a group of reseachers, sometimes even from very well known universities, come together to "solve" this problem. Despite this, these papers will be quite well-cited as compared to the mean, which makes me really confused. I am not sure if the researchers are just not aware of the weakness of their assumptions, as sometimes their work intersects with other fields, which may have much deeper insights. Sometimes it almost reads like a couple of researchers are trying to discover a new field, and in the process wrote a paper together. I'm not mentioning the ML field in question but I suspect this is a "global" problem. Has anyone else seen a similar trend? FYI I just read the post: https://www.reddit.com/r/MachineLearning/comments/uyratt/d_i_dont_really_trust_papers_out_of_top_labs/ after I wrote this, but what's happening there (solving CIFAR10) is similar to what I'm describing, although the ML field I was thinking of is a bit more mathy. In any case, I'm sure this paper will be extraordinarily well cited. submitted by /u/fromnighttilldawn [link] [comments]  ( 1 min )
    [P] Confidence Intervals for binary classification
    Hello r/machinelearning, the situation: I am currently trying to estimate the accuracy of a machine vision system. The goal is to automate surface inspection in an industrial environment. e.g. detecting scratches and dents on a flat surface. i ran quite a few trials in order to estimate the systems accruacy. But a point-estimator isnt worth much without a confidence interval. Literature does not recommend to aproximate the binomial distribution of the bernoulli-trials I conducted with a normal distribution, when the probability of success is near 1 (or 0), because the central limit theorem does not apply here. Instead the agresti-Coull-CI is recommended. the problem: at confidence level 1-alpha = 95% the upper boundary of my confidence interval exceeds 100%, which strikes me as illogical. ​ the question: Can you give me a piece of advice on how to estimate the probability of (in)correct classification? Is the agresti-coull-intervall a good method for constructing a confidence intervall with n > 100 trials and if so is it possible to get an upper boundary >100% or does this result hint at a misscalculation? ​ Appendix: CI_AgrestiCoull = p +- 2*sqrt((p-p²)/n) with p = k/n , k = correct classifications + 2, n = number of trials + 4 (Values for confidence level 95%) https://www.jstor.org/stable/pdf/2685469.pdf?casa_token=0togmzmzFqMAAAAA:KSp8k04k859iUcDfVvJEzMh4y-2a7aheeBpOm5vMB-SNj2z7m8LeOs8C5gmJcZY1tmEhpK3OlA9Sqfav6mvHLhbBDaAhpYK2phXBD3uWGo0ZpdMFjqCo submitted by /u/chkthat [link] [comments]  ( 1 min )
    [R][P] Gradio Web Demo for HairCLIP: Design Your Hair by Text and Reference Image
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Fairness of comparison of superiority and efficiency between different neural network architectures
    During the last 10 years Deep Learning has made impressive progress in various domains, but here I would like to be concrete and focus on computer vision, and in particular ImageNet as a popular benchmark. Computer vision models have evolved much from vanilla CNNs consisting of only convolutional layers + activations + pooling to more advanced with skip connections, depthwise separable convolutions, squeeze-excitation blocks, and, recently, vision transformers and derivative models. There are a lot of papers proposing some architecture changes and claiming that at a given amount of parameters and FLOPs they achieve the best result, i.e lie highest of all on the Pareto front. ​ From MobileNet V3 However, the final performance depends not only on how efficient and optimal is design of th…  ( 2 min )
    [R] OnePose can estimate 6D poses of arbitrary household objects without instance/category-specific training or CAD models
    submitted by /u/SpatialComputing [link] [comments]  ( 2 min )
    [P] I reviewed 50+ open-source MLOps tools. Here’s the result
    I spent the past weeks researching the most popular open-source MLOps tools and I would like to share the results with you. I created a website (https://mymlops.com/) listing the tools, explaining when to use each of them and pitfalls to watch out for. You can fill in your stack based on our template. Why did I do it? I feel the current MLOps landscape has amazing tools. But so many of them! Right now picking a solution feels like a puzzle. It’s confusing and incredibly hard to put the pieces together to fit your needs. This is just a small contribution. If you want me to be more helpful, I have a small ask. What are the things you currently struggle with but haven’t guidance anywhere? Here are some of my thoughts: Examples of MLOps stacks - tools that work well together and are popular combinations Compare tools with code snippets - see tools in action Stack “cookiecutter” - code templates of tools working together I would be grateful if you could point me in the right direction. No opinion is too small! 🙂 submitted by /u/Academic_Arrak [link] [comments]  ( 2 min )
    [R] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
    submitted by /u/koolaidman123 [link] [comments]
    [P] Homemade algorithm for building faces in < 1 minute with 80 images.
    Was inspired by the use of Wasserstein distances to come up with a new method for generating Eg. faces without using neural networks. Raw output + noise reduction https://preview.redd.it/r8babhtex7291.png?width=1090&format=png&auto=webp&s=f6539a7fab8b6d4b169fe0357151fc07d41e106d https://preview.redd.it/tunt9vn148291.png?width=1125&format=png&auto=webp&s=b109eeaa2beb7884946637290efbbc3d4bd50672 Specifications Training data around ~80 pieces of 128x128 images (B/W) Build time: Fast RAM usage: Extremely high Parallelizable: Yes (Step 1 in algorithm) and No (step 2 and step 3 in algorithm) Scalability: Poor. Algorithm used: 1) Build dictionary with pool of vectors/fragments from training set- Start by sliding over each image with a 9x9 grid (may be different sizes). For each sub fra…  ( 2 min )
    [R] Happy to share our latest Research paper: Federated Learning for Healthcare Domain - Pipeline, Applications and Challenges
    Hello everyone, We are pleased to share our new research paper with the ML community Federated Learning for Healthcare Domain - Pipeline, Applications and Challenges. Our paper was accepted at ACM Transactions on Computing for Healthcare 2022 and published in ACM Digital Library. Federated learning (FL) is a novel paradigm that allows deploying large-scale machine learning models trained in different data centres without transferring data. The sensitive and distributed nature of EHR (Electronic Health Records) in real-world scenarios simulates a need for an effective mechanism to learn from data residing in health-related institutions and hospitals while accounting for data privacy. This motivates us to examine the potential and value of federated learning in the healthcare domain. ​ …  ( 2 min )
    [R] Guidance: a cheat code for diffusion models (Blog post)
    submitted by /u/hardmaru [link] [comments]
    [D] which small data problems pique your interest
    Motivated by a recent post (can't seem to find it now; maybe from another subreddit) about how DL architecture research is gatekeep-ed by intensive computational requirements, I'd like to ask: what are your favorite small data problems? Why do you find it interesting? submitted by /u/SpookyTardigrade [link] [comments]  ( 1 min )
    [D][R] How to create/tag the dataset for the sentence similarity task?
    Hello everyone, I have a large corpus of domain-specific documents. I have gone through the quora question pair dataset and BIOSSES datasets for reference. I am trying to tag the dataset for different Language understanding tasks, including sentence similarity. But I am having difficulty creating the SST dataset for such a task. If I have 2-3 experts in that field, what is the best way to create the dataset for such a task? My thoughts: Approach 1 I am converting all documents into paragraphs and encoding them using USE, Elmo or domain-specific Bert embeddings. Extract top-10 similar sentences for each sentence (using Cosine or Levenshtein distance) A UI will show the expert a sentence and its similar sentences. An expert will choose the best similar sentence for the given sentence, and in the backend, both sentences will be tagged with label 1 ( which indicates that both are similar) Later, expert two will review this tagged dataset ( tagged by expert 1 ) and give the scale between 0-5; how similar are they? ​ Approach 2 Replace random keywords in sentences with synonyms using the domain pre-trained bert model. Show both sentences ( replaced sentence and original sentence ) to the expert, and let the expert tag the sentence with a scale of 0-5? ​ I'd love to hear suggestions from members of this subreddit. Any kind of input would be appreciated. submitted by /u/aadityaura [link] [comments]  ( 1 min )
    [D] Best Tech Stack for Machine Learning Web Applications
    According to your experience deploying ML Solutions, what would be the best tech stack for deploying a web application which integrates various ML Algorithms in the background? Currently, I'm looking at FA.R.M (FastAPI, React JS, MongoDB) - but not sure what your take would be on this. Also, since most of our algorithms run on Notebooks - what would be the best practice for moving their outputs to production? Any kind of input would be appreciated. submitted by /u/XhoniShollaj [link] [comments]  ( 1 min )
  • Open

    Cosmic Creation | MASTERPIECE BATTLE - RAW VS SMOOTH
    submitted by /u/LordPewPew777 [link] [comments]
    Borealis AI Research Introduces fAux: A New Approach To Test Individual Fairness via Gradient Alignment
    Machine learning models are trained on massive datasets with hundreds of thousands, if not billions, of parameters. However, how these models translate the input parameters into results is unknown. Having said that, the decision-making behavior of the model is difficult to comprehend. Furthermore, models are frequently skewed towards specific parameters due to faulty assumptions made during the machine learning process, which are difficult to detect. Researchers from Borealis AI introduced fAux, a new approach to testing fairness. They state that one approach to assessing fairness at the global level is to look at it from afar. By aggregating results across a complete population, the goal is to statistically quantify disparate treatment. The distribution of good and negative outcomes is then tested using fairness criteria. These are simple to build and can be computed without having access to the original model — because one only needs the model’s predictions. Moreover, historical data can even be tested. Continue reading | Research: 'fAux: Testing Individual Fairness via Gradient Alignment' https://i.redd.it/9crqd3ymea291.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    3D modelization directly from a picture...Insane speed and quality
    submitted by /u/the_anonymizer [link] [comments]
    AI helps to track health of coroal reefs by learning "song of the reef"
    submitted by /u/qptbook [link] [comments]
    Moving Local AI/ML Experiments to the Cloud with Terraform Plugin - Tutorial
    Training a machine learning model locally is quick & easy to set up a new project on a local machine. This is sufficient for simple experiments (with reduced data subsets or small models) without paying to rent heavy cloud compute resources which are also could be intimidating even with a decent background in DevOps. Once you locally set up and iterate over your data & code enough, you may reach a point where more powerful compute resources are needed to train a larger model and/or use bigger datasets with a methodology explained in the following guide: Moving Local Experiments to the Cloud with Terraform Provider Iterative (TPI) submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Meta's new muscle AI could power next-gen avatars
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    Experiential Learning from Sequential Data - Anton Kolonin - OpenCog AGI Discussion
    submitted by /u/akolonin [link] [comments]
  • Open

    Developing a Python Program Using Inspection Tools
    Python is an interpreting language. It means there is an interpreter to run our program, rather than compiling the code and running natively. In Python, a REPL (read-eval-print loop) can run commands line by line. Together with some inspection tools provided by Python, it helps to develop codes. In the following, you will see how […] The post Developing a Python Program Using Inspection Tools appeared first on Machine Learning Mastery.  ( 16 min )
  • Open

    "Flexible Diffusion Modeling of Long Videos", Harvey et al 2022 (Minecraft, CARLA self-driving car, DMLab video modeling: stable 1h-long video samples)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    How to memorize the ASCII table
    Before discussing how you could memorize an table of ASCII characters and numeric values, I should say a little about why you might do so. One reason is simply for the challenge. It’s more doable than it may sound. It’s also useful information, though it’s debatable whether it’s worth memorizing. YMMV. There was a time […] How to memorize the ASCII table first appeared on John D. Cook.  ( 4 min )

  • Open

    [R] Reconnaissance Blind Chess - Join the NeurIPS Competition!
    Create a bot for the NeurIPS 2022 competition in Reconnaissance Blind Chess! Reconnaissance Blind Chess is a chess variant designed for new research in artificial intelligence. RBC includes imperfect information, long-term strategy, explicit observations, and almost no common knowledge. These features appear in real-world scenarios, and challenge even state of the art algorithms including those used to create super-human bots in chess, Go, and poker, for example. Each player of RBC controls traditional chess pieces, but cannot directly see the locations of her opponent's pieces. Rather, she learns partial information each turn by privately sensing a 3x3 area of the board. RBC's foundation in traditional chess makes it familiar and entertaining to human players, too! There is no cost to enter this tournament. Winners will receive a small monetary prize and authors of the best AIs will be invited talk about their bots at NeurIPS, the world's largest AI conference. Learn more, play a game of RBC yourself, and join our research community at https://rbc.jhuapl.edu ! ​ https://preview.redd.it/xzd9z110c3291.png?width=150&format=png&auto=webp&s=a0b86fe6c0e3c3060f30d0e1eb8acfd81f6bb9dd ​ Organized by: Johns Hopkins University Applied Physics Laboratory with Ashley J. Llorens (Microsoft Research) Todd W. Neller (Gettysburg College) Raman Arora (Johns Hopkins University) Bo Li (University of Illinois) Mykel J. Kochenderfer (Stanford University) submitted by /u/rwgardner [link] [comments]  ( 1 min )
    [D] How to train a model to identify ranked classes?
    Hi, I am trying to train a model to estimate the severity of an image in classes like "normal", "mild", "moderate", "severe". One approach would be to do multiclass classification, but that seems suboptimal since it doesn't encode the knowledge that the classes are not random, but ranked (ie normal < mild < moderate < severe). Another approach is to encode these classes are a number (ie normal=0, mild=1, moderate=2, severe=3) and perform regression. This seems sensible but I have never seen it done. Is there any literature on this topic? Is there another approach I am missing? submitted by /u/rsandler [link] [comments]  ( 1 min )
    [P] TensorFlow Similarity 0.16 is out
    Happy Friday, Just a quick note that TensorFlow Similarity 0.16 is out -- this release beside adding the XMB loss is mostly focus on refactoring and optimizing the core components to ensure everything works smoothly and accurately. Details are in the changelog as usual and a simple pip install -U tensorflow_similarity should just work. We spend a lot of time behind the scene making sure STOA papers results can be reproduced and fixed a lot of bugs (including in augmentations) that should give you some accuracy boost compared to 0.16. Next we're going to keep working toward providing a strong foundations and extensive benchmarking capabilities so you can rely on it for your research. The last missing piece before 1.0 is how we do storage so it scale past 10M points and work with many ANN backend. If you are interested in helping let us know. Have a great weekend! submitted by /u/ebursztein [link] [comments]  ( 1 min )
    [R] Flexible Diffusion Modeling of Long Videos
    Paper. Abstract: We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA self-driving car simulator. Blog post (includes generated videos). Twitter thread from some of the authors. submitted by /u/Wiskkey [link] [comments]  ( 1 min )
    On the Paradox of Learning to Reason from Data - Language models only learn a facsimile of reasoning based off of inherent statistical features
    submitted by /u/stressed-nb [link] [comments]  ( 1 min )
    [P] BrainAgent: Open Source for SOTA Performance on DMLab-30 of Multi-Task RL !
    Hello. I'd like to introduce an awesome project "Brain Agent." github: https://github.com/kakaobrain/brain_agent Brain Agent is a codebase for large-scale RL. The key contribution is the SOTA result & open-sourced codes and checkpoints for the DMLab-30 environment. DMLab-30 for Multi-task RL DMLab-30 is an environment for multi-task RL, consisting of different 30 tasks, developed by DeepMind. The tasks are hard to solve and important for multi-task RL research, but there was no reproducible codebase for SOTA performance on it. As the result, the SOTA performance has been always reported in the papers by DeepMind, but other RL researchers except for DeepMind cannot conduct a cutting-edge research on the environment. In this project, Kakao Brain succeed in achieving state-of-the-art performance on DMLab-30. In addition, we released the codes for evaluation and pretrained checkpoints, and hope our project help many RL researches focus how to solve the difficult multi-task RL tasks on DMLab-30. Reported Performance of Released Checkpoints Enjoy and ⭐️ submitted by /u/leedoyup [link] [comments]  ( 1 min )
    [D] Can AI Replace Our Graphic Designer?
    A nice video by MKBHD that gives some really nice insights on the effect that Dalle 2 or ml will have in the future. Considering that the model can make multiple examples in a very short space of time, things are getting very interesting and scary l will say. https://youtu.be/MwAAH9tBoMg submitted by /u/takuonline [link] [comments]  ( 1 min )
    Feasibility of aggregating text messages to train a system for natural language/chat bot [D]
    I wonder if this is possible, I know getting messages from random people could create a lot of noise and varied inputs, but does it seem feasible to clean/prepare texts in such a way to use them for this purpose? submitted by /u/WordJord [link] [comments]  ( 1 min )
    [D] good datasets for evaluating ranking models
    What are some good datasets for evaluating ranking models these days? I have seen Criteo and MoveLen-1m from Ed Chi’s team’s papers. Criteria doesn’t seem to be available anymore. Any other things people have been using? Please advise. Thanks. submitted by /u/scan33scan33 [link] [comments]
    [R] Training ReLU networks to high uniform accuracy is intractable
    PDF on ResearchGate / arXiv Abstract: Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results. submitted by /u/julbern [link] [comments]  ( 1 min )
    [P] Solving Sudoku in real-time using a Convolutional Neural Network and OpenCV
    Article and a source code: https://dmitryelj.medium.com/solving-sudoku-in-real-time-using-a-convolutional-neural-network-and-opencv-e47a92478dce submitted by /u/DmitriiElj [link] [comments]
    [P] Making the annotation part of Hasty free
    We all know that you need large quantities of high-quality data to succeed in AI. But producing that data is expensive. Until now. Whatever you are paying for labeling is too much with our new release. Data labeling takes anywhere from 35 to 80% of project budgets. We drastically reduce the cost by giving you free access to all our labeling automation features without imposing usage or user limits - and it comes with no strings attached*. Read more about what we offer and how it compares with the competition here: https://hasty.ai/content-hub/articles/making-labeling-tooling-free?utm_source=2884ka1 https://preview.redd.it/putp5jrjvz191.png?width=1200&format=png&auto=webp&s=abe9fcc29ca90fd4033b36f2ed58cf504f51240e \Disclaimer: Our free plan has an upper limit of 30GB of storage.* submitted by /u/treebeard_hasty_ai [link] [comments]  ( 1 min )
    [D] I don't really trust papers out of "Top Labs" anymore
    I mean, I trust that the numbers they got are accurate and that they really did the work and got the results. I believe those. It's just that, take the recent "An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems" paper. It's 18 pages of talking through this pretty convoluted evolutionary and multitask learning algorithm, it's pretty interesting, solves a bunch of problems. But two notes. One, the big number they cite as the success metric is 99.43 on CIFAR-10, against a SotA of 99.40, so woop-de-fucking-doo in the grand scheme of things. Two, there's a chart towards the end of the paper that details how many TPU core-hours were used for just the training regimens that results in the final results. The sum total is 17,810 core-hours. Let's …  ( 6 min )
    [D] Are there any existing algorithms that apply kNN in a bootstrapping manner?
    Just out of curiosity, I was wondering what would happen if we combined kNN and bootstrapping in a certain manner. Specifically, say we have N points that we want to classify and we have a labeled set of points that are used for kNN. After applying kNN to these N points individually, we iterate over a process defined as: Repeat until class change is arbitrarily minimal: For each point in N observed points: Look at L nearest neighbors within N observed points Alter the class of the current point by majority vote of the L points In other words, applying kNN iteratively on unobserved points until "convergence" (i.e. assignment change of points is minimal). Going ahead with the idea, I tried implementing a quick snippet of code to test this out on some toy datasets, and for datasets where vanilla kNN (without altering the representation of the data) obtains an accuracy of 65%, this iterative method improves it a bit, up to around 70%. I tried doing a quick search if there already exists a similar algorithm to this and I couldn't find anything, but this seems like a simple enough idea that others must have already considered in the past. EDIT: Not too relevant, but it might help to add that this was inspired a bit from PU Learning. submitted by /u/TrepidEd0601 [link] [comments]  ( 2 min )
    Accurate detection of sepsis at ED triage using machine learning with clinical natural language processing (Preprint open for comments)
    submitted by /u/creilly94010 [link] [comments]
    [Discussion] How does the SimCLR loss function not penalize image belonging to the same class?
    I have a question about SimCLR that I have not been able to understand. In the numerator of the SimCLR loss function, $z_i$ is the original image, and $z_j$ is the augmented version of $z_i$. We want the distance to those to be small. Similarly in the denominator, $z_k$ for K = 1:2N, k =/= i, is the index of all other images in the batch. Those are going to be pushed away from $z_i$. This is fine, but what guarantee do we have that k wont belong to an image of the same class as image i? The way this is structured, we will also end up pushing away images of the same class. Thanks submitted by /u/Ayakalam [link] [comments]  ( 2 min )
  • Open

    Researchers From Imperial College London introduce TsT-GAN: A Novel Framework For Training Time-Series Generative Models
    Nowadays, data is considered a fuel in the data analytics field. The real-time applications require time series data for analysis and future prediction. But all these applications usually lack the necessary, sufficient data for analysis. Hence, various data augmentation techniques need to be adopted. Researchers from Imperial College London introduce a framework called TsT-GAN, based on generative adversarial networks (GAN), that are utilized to augment the time-series data. It aims to fulfill the various objectives like capturing the steps of the conditional distribution of real-time sequences and creating a model that joins the distribution of all the real-time sequences. The paper’s significant contribution is to develop the model consisting of a generator that can produce entire joint distributions considering the distribution conditions. The training framework can be applied to any time series dataset that quantitively results in a standard method that can be trained on the synthetic test on a realistic approach while qualitatively using t-SNE. Continue Reading | Check out the paper and related codes. https://preview.redd.it/ajeuctsyy2291.png?width=778&format=png&auto=webp&s=2d3613ee07aff5b74800a0d8c20826b82fd36097 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    in a regular ANN, does the weights and biases have to be randomized at start for backwardsprobagation to work?
    Hi! I'm currently working on a library for creating and training neural networks, however my backwardsprobagation function seems to be broken. I have gone back and forth in the algorithms, and it seems to run the correct mathematical operations, yet it doesn't work. My test involves two patterns as input and the expected output is 1, 0 for the first pattern and 0, 1 for the other. The test network has 1 hidden layer, 4 input neurons, 2 hidden neurons and 2 output neuron. When I checked my debug logs I saw that the derivative of the hidden layer, it was 0. I realized this was at least in my implementation due to the two outputs cancelling each other out. Is this supposed to happen? Am I just to randomize all values at the start or is my implementation wrong? submitted by /u/WellWhatDoIPutHere [link] [comments]  ( 1 min )
    A Long Short-Term Memory for AI Applications in Spike-based Neuromorphic Hardware
    submitted by /u/nnnaikl [link] [comments]  ( 1 min )
  • Open

    Traditional Reinforcement Learning versus POMDP.
    What exactly is the relationship between partial observability of states and the Reinforcement Learning Problem? Sutton and Barto address partial observability only briefly for about 2 pages by the back chapters, and their description is that there is some latent space of unobserved states. But their description makes it sound like this is some kind of "extension" to RL, rather than something that effects the core mechanics of an RL agent. It seems to me that POMDPs act on the RL problem in a different way than traditional RL agents, even down to how they construct their Q network, and how they go about producing their policy network. In one sentence : a traditional RL agent explores "dumb" and a POMDP agent explores "smart". I will give two examples below #POMDPs reason about un-visited s…  ( 3 min )
    Offered a position as an RL Engineer - Seeking Advice
    Hi all, I was recently offered a full time roll as an RL engineer (I am using the term engineer because I do not have a PhD so I wouldn't in good faith qualify myself as a "researcher", however, I will be doing R&D for the company on SOTA RL). I come from a standard MLE background. I have just over 3 years of experience and my masters in machine learning is almost complete. I do have experience in RL from academia. DQN from scratch, QMIX from scratch, etc. RL is a strong interest of mine and I would love to pursue it. I do have a competing MLE offer from another company and I am wondering if it would be too risky to accept a full-time RLE roll. I believe RL has massive potential in the future, however, I don't want to shoot myself in the foot by specializing for a few years on the wrong topic. Any advice would be appreciated. Also fwiw, I am being considered for a full time NLP engineering role as well, should that come through I will also have that to consider. I am a big fan of transformer tech. Any advice is appreciated. v/r, submitted by /u/OpenSource-AI [link] [comments]  ( 2 min )
    BrainAgent: Open Source for SOTA Performance on DMLab-30 of Multi-Task RL !
    Hello. I'd like to introduce an awesome project "Brain Agent." github: https://github.com/kakaobrain/brain_agent Brain Agent is a codebase for large-scale RL. The key contribution is the SOTA result & open-sourced codes and checkpoints for the DMLab-30 environment. Examples of DMLab-30 DMLab-30 is an environment for multi-task RL, consisting of different 30 tasks, developed by DeepMind. The tasks are hard to solve and important for multi-task RL research, but there was no reproducible codebase for SOTA performance on it. As the result, the SOTA performance has been always reported in the papers by DeepMind, but other RL researchers except for DeepMind cannot conduct a cutting-edge research on the environment. In this project, Kakao Brain succeed in achieving state-of-the-art performance on DMLab-30. In addition, we released the codes for evaluation and pretrained checkpoints, and hope our project help many RL researches focus how to solve the difficult multi-task RL tasks on DMLab-30. ​ Reported Performance of Released Checkpoint Enjoy and ⭐️ submitted by /u/leedoyup [link] [comments]  ( 1 min )
    In the spirit of #throwbackthursday, this is some early concept art for the training cloud where all the magic happens in our game powered by Reinforcement Learning, Animo Island!
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    RL Simulators and Frameworks
    I just started working on a personal project in which I want to solve a RL task and I have to create a whole new environment for that. I thought of using Mujoco as the simulator, but it seems to be pretty difficult to install since it has been recently acquired by deepmind and difficult to interface with RL frameworks/libraries. At the moment what simulator and rl library (algorithms) would you recommend to use? If it is easy to create a container on docker with gpu that would be even better. submitted by /u/jeferal [link] [comments]  ( 1 min )
    Multi agent path planing
    So as part of my work I am trying to tackle the multi agent path planning problem. I have already try a few optimization techniques like PSO (did not give good results) and genetic algorithms like NEAT (gave decent results but still room for improvement) so I wanted to know if anyone has worked on this problem before, what have they used and what kind of results they got? PS: I am currently testing using machine learning techniques for this like imitation learning and maybe after that I might test RL so if anyone has tried those for this problem that I would love to know what they ended up getting. submitted by /u/temp_phd [link] [comments]  ( 1 min )
    Every day, I was tossing and turning, thinking about how I made such a rubbish graduation project. The DRL is really hard😭
    submitted by /u/ecstayalive [link] [comments]  ( 1 min )
  • Open

    It’s All About Data: The Training Methods of Deep Learning
    In deep learning, there are different training methods. Which one we use in an AI project depends on the data provided by our customer: how…  ( 2 min )
  • Open

    University of Negev Researchers Develop DeepDPM: A Deep Clustering Algorithm With An Unknown Number Of Clusters
    Clustering is an essential unsupervised learning job in which class labels are not accessible, unlike in the supervised situation of classification. Furthermore, the number of classes represented by K and their relative sizes are unknown in the totally unsupervised environment on which this research focuses. Clustering tasks have not been overlooked by Deep Learning (DL). Large and high-dimensional datasets are typically clustered better and more efficiently by DL approaches than by traditional clustering methods. However, while nonparametric approaches have advantages over parametric methods (methods that require a known K) in classical clustering, there are just a few nonparametric deep clustering methods. Unfortunately, the latter is neither scalable nor effective enough. It is advantageous to be able to deduce the latent K. Parametric approaches may perform poorly if K is not accurately estimated. In both balanced and unbalanced datasets, using the wrong K can have a major negative impact on parametric approaches. Continue Reading | Check out the paper and codes ​ https://i.redd.it/rahghkvc41291.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    The past two years went down in a blink because of some pandemic? Check out this 2021 recap of the most exciting advancements in the AI field to see what you may have missed out on!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    AI and data science resourse
    Hello world! I want to share with you my telegram channel dedicated to everything concerning AI and data science (from math and statistics to programming and machine learning). I don't promise you brand new information I just collect knowledge from different sources and share it in a nice and structured way in order to make your learning process easier. I would love to see you there! Also I am open to criti cism so feel free to tell me your opinion! submitted by /u/NordicDude49 [link] [comments]  ( 1 min )
    Iterative to Launch Open Source Tool, First to Train ML Models on Any Cloud Using Terraform Solution
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Beginner; looking for AI learning buddy
    Hey there! Just started getting into machine learning. Currently learning linear regression. Would love to find a buddy to talk AI with and learn it together with (bonus if you know some stuff about hardware as well). I personally adapt the common philosophy of "if you can't explain it then you don't understand it" so if you believe we have the same goal and mindset, hit me up! submitted by /u/Signal_Warrior [link] [comments]  ( 1 min )
    Olivier Levasseur’s Treasure
    Would it be possible to program a AI that can decipher any hidden code? Or is it not? submitted by /u/I__like_bagels [link] [comments]
  • Open

    TNN7: A Custom Macro Suite for Implementing Highly Optimized Designs of Neuromorphic TNNs. (arXiv:2205.07410v2 [cs.AR] UPDATED)
    Temporal Neural Networks (TNNs), inspired from the mammalian neocortex, exhibit energy-efficient online sensory processing capabilities. Recent works have proposed a microarchitecture framework for implementing TNNs and demonstrated competitive performance on vision and time-series applications. Building on these previous works, this work proposes TNN7, a suite of nine highly optimized custom macros developed using a predictive 7nm Process Design Kit (PDK), to enhance the efficiency, modularity and flexibility of the TNN design framework. TNN prototypes for two applications are used for evaluation of TNN7. An unsupervised time-series clustering TNN delivering competitive performance can be implemented within 40 uW power and 0.05 mm^2 area, while a 4-layer TNN that achieves an MNIST error rate of 1% consumes only 18 mW and 24.63 mm^2. On average, the proposed macros reduce power, delay, area, and energy-delay product by 14%, 16%, 28%, and 45%, respectively. Furthermore, employing TNN7 significantly reduces the synthesis runtime of TNN designs (by more than 3x), allowing for highly-scaled TNN implementations to be realized.
    Adaptive Fairness-Aware Online Meta-Learning for Changing Environments. (arXiv:2205.11264v2 [cs.LG] UPDATED)
    The fairness-aware online learning framework has arisen as a powerful tool for the continual lifelong learning setting. The goal for the learner is to sequentially learn new tasks where they come one after another over time and the learner ensures the statistic parity of the new coming task across different protected sub-populations (e.g. race and gender). A major drawback of existing methods is that they make heavy use of the i.i.d assumption for data and hence provide static regret analysis for the framework. However, low static regret cannot imply a good performance in changing environments where tasks are sampled from heterogeneous distributions. To address the fairness-aware online learning problem in changing environments, in this paper, we first construct a novel regret metric FairSAR by adding long-term fairness constraints onto a strongly adapted loss regret. Furthermore, to determine a good model parameter at each round, we propose a novel adaptive fairness-aware online meta-learning algorithm, namely FairSAOML, which is able to adapt to changing environments in both bias control and model precision. The problem is formulated in the form of a bi-level convex-concave optimization with respect to the model's primal and dual parameters that are associated with the model's accuracy and fairness, respectively. The theoretic analysis provides sub-linear upper bounds for both loss regret and violation of cumulative fairness constraints. Our experimental evaluation on different real-world datasets with settings of changing environments suggests that the proposed FairSAOML significantly outperforms alternatives based on the best prior online learning approaches.
    CNNs are Myopic. (arXiv:2205.10760v2 [cs.CV] UPDATED)
    We claim that Convolutional Neural Networks (CNNs) learn to classify images using only small seemingly unrecognizable tiles. We show experimentally that CNNs trained only using such tiles can match or even surpass the performance of CNNs trained on full images. Conversely, CNNs trained on full images show similar predictions on small tiles. We also propose the first a priori theoretical model for convolutional data sets that seems to explain this behavior. This gives additional support to the long standing suspicion that CNNs do not need to understand the global structure of images to achieve state-of-the-art accuracies. Surprisingly it also suggests that over-fitting is not needed either.
    Representation learning with function call graph transformations for malware open set recognition. (arXiv:2205.06918v2 [cs.CR] UPDATED)
    Open set recognition (OSR) problem has been a challenge in many machine learning (ML) applications, such as security. As new/unknown malware families occur regularly, it is difficult to exhaust samples that cover all the classes for the training process in ML systems. An advanced malware classification system should classify the known classes correctly while sensitive to the unknown class. In this paper, we introduce a self-supervised pre-training approach for the OSR problem in malware classification. We propose two transformations for the function call graph (FCG) based malware representations to facilitate the pretext task. Also, we present a statistical thresholding approach to find the optimal threshold for the unknown class. Moreover, the experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem.
    Nonparametric likelihood-free inference with Jensen-Shannon divergence for simulator-based models with categorical output. (arXiv:2205.10890v2 [stat.ME] UPDATED)
    Likelihood-free inference for simulator-based statistical models has recently attracted a surge of interest, both in the machine learning and statistics communities. The primary focus of these research fields has been to approximate the posterior distribution of model parameters, either by various types of Monte Carlo sampling algorithms or deep neural network -based surrogate models. Frequentist inference for simulator-based models has been given much less attention to date, despite that it would be particularly amenable to applications with big data where implicit asymptotic approximation of the likelihood is expected to be accurate and can leverage computationally efficient strategies. Here we derive a set of theoretical results to enable estimation, hypothesis testing and construction of confidence intervals for model parameters using asymptotic properties of the Jensen--Shannon divergence. Such asymptotic approximation offers a rapid alternative to more computation-intensive approaches and can be attractive for diverse applications of simulator-based models. 61
    Domain Adversarial Graph Convolutional Network Based on RSSI and Crowdsensing for Indoor Localization. (arXiv:2204.05184v2 [cs.NI] UPDATED)
    In recent years, due to the wider WiFi coverage and the popularization of mobile communication devices, the technology of indoor positioning using WiFi fingerprints has been rapidly developed. Currently, most supervised methods need to collect a large amount of data to construct fingerprint datasets, which is labor-intensive and time-consuming. In addition, many studies focused on the ideal laboratory environment and lack the consideration in the practical application environment, especially in the scenario of multiple large multi-floor buildings. To solve these problems, we proposed a novel WiDAGCN model which can be trained by a few labeled site survey data and unlabeled crowdsensing WiFi fingerprints. To comprehensively represent the topology structure of the data, we constructed heterogeneous graphs according to the received signal strength indicators (RSSIs) between the waypoints and WiFi access points (APs). Moreover, previous WiFi indoor localization studies rarely involved complete graph feature representation, thus we use graph convolutional network (GCN) to extract graph-level embeddings. There are also some difficult problems, for example, a large amount of unlabeled data that cannot be applied to a supervised model, and the existence of multiple data domains leads to inconsistency in data distribution. Therefore, a semi-supervised domain adversarial training scheme was used to make full use of unlabeled data and align the data distribution of different domains. A public indoor localization dataset containing different buildings was used to evaluate the performance of the model. The experimental results show that our system can achieve a competitive localization accuracy in large buildings such as shopping malls.
    A Multi-Stage Duplex Fusion ConvNet for Aerial Scene Classification. (arXiv:2203.16325v2 [cs.CV] UPDATED)
    Existing deep learning based methods effectively prompt the performance of aerial scene classification. However, due to the large amount of parameters and computational cost, it is rather difficult to apply these methods to multiple real-time remote sensing applications such as on-board data preception on drones and satellites. In this paper, we address this task by developing a light-weight ConvNet named multi-stage duplex fusion network (MSDF-Net). The key idea is to use parameters as little as possible while obtaining as strong as possible scene representation capability. To this end, a residual-dense duplex fusion strategy is developed to enhance the feature propagation while re-using parameters as much as possible, and is realized by our duplex fusion block (DFblock). Specifically, our MSDF-Net consists of multi-stage structures with DFblock. Moreover, duplex semantic aggregation (DSA) module is developed to mine the remote sensing scene information from extracted convolutional features, which also contains two parallel branches for semantic description. Extensive experiments are conducted on three widely-used aerial scene classification benchmarks, and reflect that our MSDF-Net can achieve a competitive performance against the recent state-of-art while reducing up to 80% parameter numbers. Particularly, an accuracy of 92.96% is achieved on AID with only 0.49M parameters.
    Near-Optimal Sparse Allreduce for Distributed Deep Learning. (arXiv:2201.07598v2 [cs.DC] UPDATED)
    Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).
    ChemicalX: A Deep Learning Library for Drug Pair Scoring. (arXiv:2202.05240v3 [cs.LG] UPDATED)
    In this paper, we introduce ChemicalX, a PyTorch-based deep learning library designed for providing a range of state of the art models to solve the drug pair scoring task. The primary objective of the library is to make deep drug pair scoring models accessible to machine learning researchers and practitioners in a streamlined framework.The design of ChemicalX reuses existing high level model training utilities, geometric deep learning, and deep chemistry layers from the PyTorch ecosystem. Our system provides neural network layers, custom pair scoring architectures, data loaders, and batch iterators for end users. We showcase these features with example code snippets and case studies to highlight the characteristics of ChemicalX. A range of experiments on real world drug-drug interaction, polypharmacy side effect, and combination synergy prediction tasks demonstrate that the models available in ChemicalX are effective at solving the pair scoring task. Finally, we show that ChemicalX could be used to train and score machine learning models on large drug pair datasets with hundreds of thousands of compounds on commodity hardware.
    Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations. (arXiv:2204.04773v2 [stat.ML] UPDATED)
    Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.
    Lorentzian Fully Hyperbolic Generative Adversarial Network. (arXiv:2201.12825v2 [cs.LG] UPDATED)
    With the recent advance of deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. While a variety of hyperbolic neural network structures have been proposed, they mainly focus on discriminative tasks, and generative models in the hyperbolic space have scarcely been studied. In this work, we propose a hyperbolic generative adversarial network (GAN) within the Lorentz model for generating hyperbolic data. In addition to existing hyperbolic operations, we design novel hyperbolic layers to guarantee stable training. We first use synthetic data to show that our network is able to learn simple distribution in the hyperbolic space. Moreover, by virtue of an autoencoder, we construct a neural network model, named HAEGAN, for generating more complex data in the hyperbolic space. HAEGAN contains three parts: first, a hyperbolic autoencoder; second, a hyperbolic GAN for generating the latent embedding of the autoencoder; third, a generator that inherits the decoder from autoencoder and the generator from the GAN. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.
    Incremental Inference on Higher-Order Probabilistic Graphical Models Applied to Constraint Satisfaction Problems. (arXiv:2202.12916v2 [cs.LG] UPDATED)
    Probabilistic graphical models (PGMs) are tools for solving complex probabilistic relationships. However, suboptimal PGM structures are primarily used in practice. This dissertation presents three contributions to the PGM literature. The first is a comparison between factor graphs and cluster graphs on graph colouring problems such as Sudokus - indicating a significant advantage for preferring cluster graphs. The second is an application of cluster graphs to a practical problem in cartography: land cover classification boosting. The third is a PGMs formulation for constraint satisfaction problems and an algorithm called purge-and-merge to solve such problems too complex for traditional PGMs.
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v2 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.
    FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients. (arXiv:2201.01601v2 [cs.LG] UPDATED)
    Federated Learning (FL) trains a machine learning model on distributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients' training samples. Our sample selection strategy prioritizes more "informative" data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client training data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.20~4.48x while improving the model accuracy by 1.1~5.0%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.
    Domain-informed neural networks for interaction localization within astroparticle experiments. (arXiv:2112.07995v2 [hep-ex] UPDATED)
    This work proposes a domain-informed neural network architecture for experimental particle physics, using particle interaction localization with the time-projection chamber (TPC) technology for dark matter research as an example application. A key feature of the signals generated within the TPC is that they allow localization of particle interactions through a process called reconstruction. While multilayer perceptrons (MLPs) have emerged as a leading contender for reconstruction in TPCs, such a black-box approach does not reflect prior knowledge of the underlying scientific processes. This paper looks anew at neural network-based interaction localization and encodes prior detector knowledge, in terms of both signal characteristics and detector geometry, into the feature encoding and the output layers of a multilayer neural network. The resulting Domain-informed Neural Network (DiNN) limits the receptive fields of the neurons in the initial feature encoding layers in order to account for the spatially localized nature of the signals produced within the TPC. This aspect of the DiNN, which has similarities with the emerging area of graph neural networks in that the neurons in the initial layers only connect to a handful of neurons in their succeeding layer, significantly reduces the number of parameters in the network in comparison to an MLP. In addition, in order to account for the detector geometry, the output layers of the network are modified using two geometric transformations to ensure the DiNN produces localizations within the interior of the detector. The end result is a neural network architecture that has 60% fewer parameters than an MLP, but that still achieves similar localization performance and provides a path to future architectural developments with improved performance because of their ability to encode additional domain knowledge into the architecture.
    Nonlinear Transform Source-Channel Coding for Semantic Communications. (arXiv:2112.10961v2 [cs.IT] UPDATED)
    In this paper, we propose a class of high-efficiency deep joint source-channel coding methods that can closely adapt to the source distribution under the nonlinear transform, it can be collected under the name nonlinear transform source-channel coding (NTSCC). In the considered model, the transmitter first learns a nonlinear analysis transform to map the source data into latent space, then transmits the latent representation to the receiver via deep joint source-channel coding. Our model incorporates the nonlinear transform as a strong prior to effectively extract the source semantic features and provide side information for source-channel coding. Unlike existing conventional deep joint source-channel coding methods, the proposed NTSCC essentially learns both the source latent representation and an entropy model as the prior on the latent representation. Accordingly, novel adaptive rate transmission and hyperprior-aided codec refinement mechanisms are developed to upgrade deep joint source-channel coding. The whole system design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics. Across test image sources with various resolutions, we find that the proposed NTSCC transmission method generally outperforms both the analog transmission using the standard deep joint source-channel coding and the classical separation-based digital transmission. Notably, the proposed NTSCC method can potentially support future semantic communications due to its content-aware ability and perceptual optimization goal.
    The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks. (arXiv:2108.11489v2 [stat.ML] UPDATED)
    The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.
    Classification of COVID-19 on chest X-Ray images using Deep Learning model with Histogram Equalization and Lungs Segmentation. (arXiv:2112.02478v2 [eess.IV] UPDATED)
    Background and Objective: Artificial intelligence (AI) methods coupled with biomedical analysis has a critical role during pandemics as it helps to release the overwhelming pressure from healthcare systems and physicians. As the ongoing COVID-19 crisis worsens in countries having dense populations and inadequate testing kits like Brazil and India, radiological imaging can act as an important diagnostic tool to accurately classify covid-19 patients and prescribe the necessary treatment in due time. With this motivation, we present our study based on deep learning architecture for detecting covid-19 infected lungs using chest X-rays. Dataset: We collected a total of 2470 images for three different class labels, namely, healthy lungs, ordinary pneumonia, and covid-19 infected pneumonia, out of which 470 X-ray images belong to the covid-19 category. Methods: We first pre-process all the images using histogram equalization techniques and segment them using U-net architecture. VGG-16 network is then used for feature extraction from the pre-processed images which is further sampled by SMOTE oversampling technique to achieve a balanced dataset. Finally, the class-balanced features are classified using a support vector machine (SVM) classifier with 10-fold cross-validation and the accuracy is evaluated. Result and Conclusion: Our novel approach combining well-known pre-processing techniques, feature extraction methods, and dataset balancing method, lead us to an outstanding rate of recognition of 98% for COVID-19 images over a dataset of 2470 X-ray images. Our model is therefore fit to be utilized in healthcare facilities for screening purposes.
    Evaluating Generalization in Classical and Quantum Generative Models. (arXiv:2201.08770v2 [cs.LG] UPDATED)
    Defining and accurately measuring generalization in generative models remains an ongoing challenge and a topic of active research within the machine learning community. This is in contrast to discriminative models, where there is a clear definition of generalization, i.e., the model's classification accuracy when faced with unseen data. In this work, we construct a simple and unambiguous approach to evaluate the generalization capabilities of generative models. Using the sample-based generalization metrics proposed here, any generative model, from state-of-the-art classical generative models such as GANs to quantum models such as Quantum Circuit Born Machines, can be evaluated on the same ground on a concrete well-defined framework. In contrast to other sample-based metrics for probing generalization, we leverage constrained optimization problems (e.g., cardinality constrained problems) and use these discrete datasets to define specific metrics capable of unambiguously measuring the quality of the samples and the model's generalization capabilities for generating data beyond the training set but still within the valid solution space. Additionally, our metrics can diagnose trainability issues such as mode collapse and overfitting, as we illustrate when comparing GANs to quantum-inspired models built out of tensor networks. Our simulation results show that our quantum-inspired models have up to a $68 \times$ enhancement in generating unseen unique and valid samples compared to GANs, and a ratio of 61:2 for generating samples with better quality than those observed in the training set. We foresee these metrics as valuable tools for rigorously defining practical quantum advantage in the domain of generative modeling.
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v3 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.
    Improved Fine-Tuning by Better Leveraging Pre-Training Data. (arXiv:2111.12292v2 [cs.CV] UPDATED)
    As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline.
    DeepTrack: Lightweight Deep Learning for Vehicle Path Prediction in Highways. (arXiv:2108.00505v2 [cs.LG] UPDATED)
    Vehicle trajectory prediction is essential for enabling safety-critical intelligent transportation systems (ITS) applications used in management and operations. While there have been some promising advances in the field, there is a need for modern deep learning algorithms that allow real-time trajectory prediction on embedded IoT devices. This article presents DeepTrack, a novel deep learning algorithm customized for real-time vehicle trajectory prediction and monitoring applications in arterial management, freeway management, traffic incident management, and work zone management for high-speed incoming traffic. In contrast to previous methods, the vehicle dynamics are encoded using Temporal Convolutional Networks (TCNs) to provide more robust time prediction with less computation. DeepTrack also uses depthwise convolution, which reduces the complexity of models compared to existing approaches in terms of model size and operations. Overall, our experimental results demonstrate that DeepTrack achieves comparable accuracy to state-of-the-art trajectory prediction models but with smaller model sizes and lower computational complexity, making it more suitable for real-world deployment.
    Trustworthy AI: From Principles to Practices. (arXiv:2110.01167v2 [cs.AI] UPDATED)
    The rapid development of Artificial Intelligence (AI) technology has enabled the deployment of various systems based on it. However, many current AI systems are found vulnerable to imperceptible attacks, biased against underrepresented groups, lacking in user privacy protection. These shortcomings degrade user experience and erode people's trust in all AI systems. In this review, we provide AI practitioners with a comprehensive guide for building trustworthy AI systems. We first introduce the theoretical framework of important aspects of AI trustworthiness, including robustness, generalization, explainability, transparency, reproducibility, fairness, privacy preservation, and accountability. To unify currently available but fragmented approaches toward trustworthy AI, we organize them in a systematic approach that considers the entire lifecycle of AI systems, ranging from data acquisition to model development, to system development and deployment, finally to continuous monitoring and governance. In this framework, we offer concrete action items for practitioners and societal stakeholders (e.g., researchers, engineers, and regulators) to improve AI trustworthiness. Finally, we identify key opportunities and challenges for the future development of trustworthy AI systems, where we identify the need for a paradigm shift toward comprehensively trustworthy AI systems.
    FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization. (arXiv:2111.14655v2 [cs.LG] UPDATED)
    One underlying assumption of recent federated learning (FL) paradigms is that all local models usually share the same network architecture and size, which becomes impractical for devices with different hardware resources. A scalable federated learning framework should address the heterogeneity that clients have different computing capacities and communication capabilities. To this end, this paper proposes FedHM, a novel heterogeneous federated model compression framework, distributing the heterogeneous low-rank models to clients and then aggregating them into a full-rank model. Our solution enables the training of heterogeneous models with varying computational complexities and aggregates them into a single global model. Furthermore, FedHM significantly reduces the communication cost by using low-rank models. Extensive experimental results demonstrate that FedHM is superior in the performance and robustness of models of different sizes, compared with state-of-the-art heterogeneous FL methods under various FL settings. Additionally, the convergence guarantee of FL for heterogeneous devices is first theoretically analyzed.
    Learning Perceptual Locomotion on Uneven Terrains using Sparse Visual Observations. (arXiv:2109.14026v2 [cs.RO] UPDATED)
    To proactively navigate and traverse various terrains, active use of visual perception becomes indispensable. We aim to investigate the feasibility and performance of using sparse visual observations to achieve perceptual locomotion over a range of common terrains (steps, ramps, gaps, and stairs) in human-centered environments. We formulate a selection of sparse visual inputs suitable for locomotion over the terrains of interest, and propose a learning framework to integrate exteroceptive and proprioceptive states. We specifically design the state observations and a training curriculum to learn feedback control policies effectively over a range of different terrains. We extensively validate and benchmark the learned policy in various tasks: omnidirectional walking on flat ground, and forward locomotion over various obstacles, showing high success rate of traversability. Furthermore, we study exteroceptive ablations and evaluate policy generalization by adding various levels of noise and testing on new unseen terrains. We demonstrate the capabilities of autonomous perceptual locomotion that can be achieved by only using sparse visual observations from direct depth measurements, which are easily available from a Lidar or RGB-D sensor, showing robust ascent and descent over high stairs of 20 cm height, i.e., 50% leg length, and robustness against noise and unseen terrains.
    RKHS-SHAP: Shapley Values for Kernel Methods. (arXiv:2110.09167v2 [stat.ML] UPDATED)
    Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values~(SV), a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing SVs from a functional perspective, we propose \textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both \emph{Interventional} and \emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for consistent model interpretation. Further, we propose \emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the SVs of sensitive features.
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v3 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are indeed able to satisfy the first two factors. Moreover, we empirically verify the third factor by conducting various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.
    Amplitude Mean of Functional Data on $\mathbb{S}^2$. (arXiv:2107.13721v4 [stat.ML] UPDATED)
    Manifold-valued functional data analysis (FDA) recently becomes an active area of research motivated by the raising availability of trajectories or longitudinal data observed on non-linear manifolds. The challenges of analyzing such data come from many aspects, including infinite dimensionality and nonlinearity, as well as time-domain or phase variability. In this paper, we study the amplitude part of manifold-valued functions on $\mathbb{S}^2$, which is invariant to random time warping or re-parameterization. Utilizing the nice geometry of $\mathbb{S}^2$, we develop a set of efficient and accurate tools for temporal alignment of functions, geodesic computing, and sample mean calculation. At the heart of these tools, they rely on gradient descent algorithms with carefully derived gradients. We show the advantages of these newly developed tools over its competitors with extensive simulations and real data and demonstrate the importance of considering the amplitude part of functions instead of mixing it with phase variability in manifold-valued FDA.
    TIP: Task-Informed Motion Prediction for Intelligent Vehicles. (arXiv:2110.08750v2 [cs.RO] UPDATED)
    When predicting trajectories of road agents, motion predictors usually approximate the future distribution by a limited number of samples. This constraint requires the predictors to generate samples that best support the task given task specifications. However, existing predictors are often optimized and evaluated via task-agnostic measures without accounting for the use of predictions in downstream tasks, and thus could result in sub-optimal task performance. In this paper, we propose a task-informed motion prediction model that better supports the tasks through its predictions, by jointly reasoning about prediction accuracy and the utility of the downstream tasks, which is commonly used to evaluate the task performance. The task utility function does not require the full task information, but rather a specification of the utility of the task, resulting in predictors that serve a wide range of downstream tasks. We demonstrate our approach on two use cases of common decision making tasks and their utility functions, in the context of autonomous driving and parallel autonomy. Experiment results show that our predictor produces accurate predictions that improve the task performance by a large margin in both tasks when compared to task-agnostic baselines on the Waymo Open Motion dataset.
    Coherent Probabilistic Aggregate Queries on Long-horizon Forecasts. (arXiv:2111.03394v2 [cs.LG] UPDATED)
    Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic forecasting method that produces forecasts that are coherent in terms of base level and predicted aggregate statistics. We achieve the coherency between predicted base-level and aggregate statistics using a novel inference method based on KL-divergence that can be solved efficiently in closed form. We show that our method improves forecast performance across both base level and unseen aggregates post inference on real datasets ranging three diverse domains. (\href{https://github.com/pratham16cse/AggForecaster}{Project URL})
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v3 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations; a statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Cyberphysical systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm.
    MixR: Data Mixing Augmentation for Regression. (arXiv:2106.03374v3 [cs.LG] UPDATED)
    Data augmentation is becoming essential for improving regression accuracy in critical applications including manufacturing, climate prediction, and finance. Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks. In particular, the recent Mixup techniques for classification have succeeded in improving the model performance, which is reasonable due to the characteristics of the classification task, but has limitations in regression. We show that mixing examples that have large data distances using linear interpolations may have increasingly-negative effects on model performance. Hence, we use the stricter assumption that linearity only holds within certain data distances for regression where the degree may vary by each example. We then propose MixR, a data augmentation framework for regression that learns for each example how many nearest neighbors it should be mixed with for the best model performance using a validation set. Our experiments conducted both on synthetic and real datasets show that MixR significantly outperforms state-of-the-art data augmentation baselines applicable to regression. MixR can also be integrated with existing Mixup techniques to significantly improve their performances.
    Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap. (arXiv:2205.13527v1 [stat.ML])
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.
    Verifying Learning-Based Robotic Navigation Systems. (arXiv:2205.13536v1 [cs.RO])
    Deep reinforcement learning (DRL) has become a dominant deep-learning paradigm for various tasks in which complex policies are learned within reactive systems. In parallel, there has recently been significant research on verifying deep neural networks. However, to date, there has been little work demonstrating the use of modern verification tools on real, DRL-controlled systems. In this case-study paper, we attempt to begin bridging this gap, and focus on the important task of mapless robotic navigation -- a classic robotics problem, in which a robot, usually controlled by a DRL agent, needs to efficiently and safely navigate through an unknown arena towards a desired target. We demonstrate how modern verification engines can be used for effective model selection, i.e., the process of selecting the best available policy for the robot in question from a pool of candidate policies. Specifically, we use verification to detect and rule out policies that may demonstrate suboptimal behavior, such as collisions and infinite loops. We also apply verification to identify models with overly conservative behavior, thus allowing users to choose superior policies that are better at finding an optimal, shorter path to a target. To validate our work, we conducted extensive experiments on an actual robot, and confirmed that the suboptimal policies detected by our method were indeed flawed. We also compared our verification-driven approach to state-of-the-art gradient attacks, and our results demonstrate that gradient-based methods are inadequate in this setting. Our work is the first to demonstrate the use of DNN verification backends for recognizing suboptimal DRL policies in real-world robots, and for filtering out unwanted policies. We believe that the methods presented in this work can be applied to a large range of application domains that incorporate deep-learning-based agents.
    Steerable 3D Spherical Neurons. (arXiv:2106.13863v6 [cs.CV] UPDATED)
    Emerging from low-level vision theory, steerable filters found their counterpart in prior work on steerable convolutional neural networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of neurons with spherical decision surfaces and operates on point clouds. Such spherical neurons are obtained by conformal embedding of Euclidean space and have recently been revisited in the context of learning representations of point sets. Focusing on 3D geometry, we exploit the isometry property of spherical neurons and derive a 3D steerability constraint. After training spherical neurons to classify point clouds in a canonical orientation, we use a tetrahedron basis to quadruplicate the neurons and construct rotation-equivariant spherical filter banks. We then apply the derived constraint to interpolate the filter bank outputs and, thus, obtain a rotation-invariant network. Finally, we use a synthetic point set and real-world 3D skeleton data to verify our theoretical findings.
    Ranking the information content of distance measures. (arXiv:2104.15079v2 [stat.ML] UPDATED)
    Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.
    Mitigating barren plateaus of variational quantum eigensolvers. (arXiv:2205.13539v1 [quant-ph])
    Variational quantum algorithms (VQAs) are expected to establish valuable applications on near-term quantum computers. However, recent works have pointed out that the performance of VQAs greatly relies on the capability of the ansatzes and is seriously limited by optimization issues such as barren plateaus (i.e., vanishing gradients). This work proposes the state efficient ansatz (SEA) for accurate quantum dynamics simulations with improved trainability. First, we show that SEA can generate an arbitrary pure state with much fewer parameters than a universal ansatz, making it efficient for tasks like ground state estimation. It also has the flexibility in adjusting the entanglement of the prepared state, which could be applied to further improve the efficiency of simulating weak entanglement. Second, we show that SEA is not a unitary 2-design even if it has universal wavefunction expressibility and thus has great potential to improve the trainability by avoiding the zone of barren plateaus. We further investigate a plethora of examples in ground state estimation and notably obtain significant improvements in the variances of derivatives and the overall optimization behaviors. This result indicates that SEA can mitigate barren plateaus by sacrificing the redundant expressibility for the target problem.
    Epistemic Neural Networks. (arXiv:2107.08924v3 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of \textit{joint} predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the \textit{epistemic neural network} (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the \textit{epinet}: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    TrustyAI Explainability Toolkit. (arXiv:2104.12717v2 [cs.AI] UPDATED)
    Artificial intelligence (AI) is becoming increasingly more popular and can be found in workplaces and homes around the world. The decisions made by such "black box" systems are often opaque; that is, so complex as to be functionally impossible to understand. How do we ensure that these systems are behaving as desired? TrustyAI is an initiative which looks into explainable artificial intelligence (XAI) solutions to address this issue of explainability in the context of both AI models and decision services. This paper presents the TrustyAI Explainability Toolkit, a Java and Python library that provides XAI explanations of decision services and predictive models for both enterprise and data science use-cases. We describe the TrustyAI implementations and extensions to techniques such as LIME, SHAP and counterfactuals, which are benchmarked against existing implementations in a variety of experiments.
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v4 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%
    TempoRL: Temporal Priors for Exploration in Off-Policy Reinforcement Learning. (arXiv:2205.13528v1 [cs.LG])
    Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features from offline data that are shared by a more diverse range of tasks, such as correlation between actions and directedness. Therefore, we introduce state-independent temporal priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks. Furthermore, we introduce a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior. We compare our approach against strong baselines and provide empirical evidence that it can accelerate reinforcement learning in long-horizon continuous control tasks under sparse reward settings.
    Spherical Message Passing for 3D Graph Networks. (arXiv:2102.05013v3 [cs.LG] UPDATED)
    We consider representation learning from 3D graphs in which each node is associated with a spatial position in 3D. This is an under explored area of research, and a principled framework is currently lacking. In this work, we propose a generic framework, known as the 3D graph network (3DGN), to provide a unified interface at different levels of granularity for 3D graphs. Built on 3DGN, we propose the spherical message passing (SMP) as a novel and specific scheme for realizing the 3DGN framework in the spherical coordinate system (SCS). We conduct formal analyses and show that the relative location of each node in 3D graphs is uniquely defined in the SMP scheme. Thus, our SMP represents a complete and accurate architecture for learning from 3D graphs in the SCS. We derive physically-based representations of geometric information and propose the SphereNet for learning representations of 3D graphs. We show that existing 3D deep models can be viewed as special cases of the SphereNet. Experimental results demonstrate that the use of complete and accurate 3D information in 3DGN and SphereNet leads to significant performance improvements in prediction tasks.
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v1 [cs.LG])
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks. For example, we improve coverage on CIFAR-10/SVHN by 10.1%/1.5% respectively at a fixed target error of 0.5%.
    Revealing the Dark Secrets of Masked Image Modeling. (arXiv:2205.13543v1 [cs.CV])
    Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.
    Memory AMP. (arXiv:2012.10861v6 [cs.IT] UPDATED)
    Approximate message passing (AMP) is a low-cost iterative parameter-estimation technique for certain high-dimensional linear systems with non-Gaussian distributions. However, AMP only applies to independent identically distributed (IID) transform matrices, but may become unreliable (e.g., perform poorly or even diverge) for other matrix ensembles, especially for ill-conditioned ones. Orthogonal/vector AMP (OAMP/VAMP) was proposed for general right-unitarily-invariant matrices to handle this difficulty. However, the Bayes-optimal OAMP/VAMP (BO-OAMP/VAMP) requires a high-complexity linear minimum mean square error (MMSE) estimator. This limits the application of OAMP/VAMP to large-scale systems. To solve the disadvantages of AMP and BO-OAMP/VAMP, this paper proposes a memory AMP (MAMP) framework under an orthogonality principle, which guarantees the asymptotic IID Gaussianity of estimation errors in MAMP. We present an orthogonalization procedure for the local memory estimators to realize the required orthogonality for MAMP. Furthermore, we propose a Bayes-optimal MAMP (BO-MAMP), in which a long-memory matched filter is proposed for interference suppression. The complexity of BO-MAMP is comparable to AMP. A state evolution is derived to asymptotically characterize the performance of BO-MAMP. Based on state evolution, the relaxation parameters and damping vector in BO-MAMP are optimized. For all right-unitarily-invariant matrices, the state evolution of the optimized BO-MAMP converges to the same fixed point as that of the high-complexity BO-OAMP/VAMP and is Bayes-optimal if its state evolution has a unique fixed point. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    Semantic Parsing of Interpage Relations. (arXiv:2205.13530v1 [cs.LG])
    Page-level analysis of documents has been a topic of interest in digitization efforts, and multimodal approaches have been applied to both classification and page stream segmentation. In this work, we focus on capturing finer semantic relations between pages of a multi-page document. To this end, we formalize the task as semantic parsing of interpage relations and we propose an end-to-end approach for interpage dependency extraction, inspired by the dependency parsing literature. We further design a multi-task training approach to jointly optimize for page embeddings to be used in segmentation, classification, and parsing of the page dependencies using textual and visual features extracted from the pages. Moreover, we also combine the features from two modalities to obtain multimodal page embeddings. To the best of our knowledge, this is the first study to extract rich semantic interpage relations from multi-page documents. Our experimental results show that the proposed method increased LAS by 41 percentage points for semantic parsing, increased accuracy by 33 percentage points for page stream segmentation, and 45 percentage points for page classification over a naive baseline.
    Transfer learning driven design optimization for inertial confinement fusion. (arXiv:2205.13519v1 [physics.plasm-ph])
    Transfer learning is a promising approach to creating predictive models that incorporate simulation and experimental data into a common framework. In this technique, a neural network is first trained on a large database of simulations, then partially retrained on sparse sets of experimental data to adjust predictions to be more consistent with reality. Previously, this technique has been used to create predictive models of Omega and NIF inertial confinement fusion (ICF) experiments that are more accurate than simulations alone. In this work, we conduct a transfer learning driven hypothetical ICF campaign in which the goal is to maximize experimental neutron yield via Bayesian optimization. The transfer learning model achieves yields within 5% of the maximum achievable yield in a modest-sized design space in fewer than 20 experiments. Furthermore, we demonstrate that this method is more efficient at optimizing designs than traditional model calibration techniques commonly employed in ICF design. Such an approach to ICF design could enable robust optimization of experimental performance under uncertainty.
    Sparse Graph Learning for Spatiotemporal Time Series. (arXiv:2205.13492v1 [cs.LG])
    Outstanding achievements of graph neural networks for spatiotemporal time series prediction show that relational constraints introduce a positive inductive bias into neural forecasting architectures. Often, however, the relational information characterizing the underlying data generating process is unavailable; the practitioner is then left with the problem of inferring from data which relational graph to use in the subsequent processing stages. We propose novel, principled -- yet practical -- probabilistic methods that learn the relational dependencies by modeling distributions over graphs while maximizing, at the same time, end-to-end the forecasting accuracy. Our novel graph learning approach, based on consolidated variance reduction techniques for Monte Carlo score-based gradient estimation, is theoretically grounded and effective. We show that tailoring the gradient estimators to the graph learning problem allows us also for achieving state-of-the-art forecasting performance while controlling, at the same time, both the sparsity of the learned graph and the computational burden. We empirically assess the effectiveness of the proposed method on synthetic and real-world benchmarks, showing that the proposed solution can be used as a stand-alone graph identification procedure as well as a learned component of an end-to-end forecasting architecture.
    AutoTSG: Learning and Synthesis for Incident Troubleshooting. (arXiv:2205.13457v1 [cs.SE])
    Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.
    Pick up the PACE: Fast and Simple Domain Adaptation via Ensemble Pseudo-Labeling. (arXiv:2205.13508v1 [cs.LG])
    Domain Adaptation (DA) has received widespread attention from deep learning researchers in recent years because of its potential to improve test accuracy with out-of-distribution labeled data. Most state-of-the-art DA algorithms require an extensive amount of hyperparameter tuning and are computationally intensive due to the large batch sizes required. In this work, we propose a fast and simple DA method consisting of three stages: (1) domain alignment by covariance matching, (2) pseudo-labeling, and (3) ensembling. We call this method $\textbf{PACE}$, for $\textbf{P}$seudo-labels, $\textbf{A}$lignment of $\textbf{C}$ovariances, and $\textbf{E}$nsembles. PACE is trained on top of fixed features extracted from an ensemble of modern pretrained backbones. PACE exceeds previous state-of-the-art by $\textbf{5 - 10 \%}$ on most benchmark adaptation tasks without training a neural network. PACE reduces training time and hyperparameter tuning time by $82\%$ and $97\%$, respectively, when compared to state-of-the-art DA methods. Code is released here: https://github.com/Chris210634/PACE-Domain-Adaptation
    Training ReLU networks to high uniform accuracy is intractable. (arXiv:2205.13531v1 [cs.LG])
    Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.
    Learning to Reconstruct Missing Data from Spatiotemporal Graphs with Sparse Observations. (arXiv:2205.13479v1 [cs.LG])
    Modeling multivariate time series as temporal signals over a (possibly dynamic) graph is an effective representational framework that allows for developing models for time series analysis. In fact, discrete sequences of graphs can be processed by autoregressive graph neural networks to recursively learn representations at each discrete point in time and space. Spatiotemporal graphs are often highly sparse, with time series characterized by multiple, concurrent, and even long sequences of missing data, e.g., due to the unreliable underlying sensor network. In this context, autoregressive models can be brittle and exhibit unstable learning dynamics. The objective of this paper is, then, to tackle the problem of learning effective models to reconstruct, i.e., impute, missing data points by conditioning the reconstruction only on the available observations. In particular, we propose a novel class of attention-based architectures that, given a set of highly sparse discrete observations, learn a representation for points in time and space by exploiting a spatiotemporal diffusion architecture aligned with the imputation task. Representations are trained end-to-end to reconstruct observations w.r.t. the corresponding sensor and its neighboring nodes. Compared to the state of the art, our model handles sparse data without propagating prediction errors or requiring a bidirectional model to encode forward and backward time dependencies. Empirical results on representative benchmarks show the effectiveness of the proposed method.
    Censored Quantile Regression Neural Networks. (arXiv:2205.13496v1 [stat.ML])
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v1 [q-bio.NC])
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.
    FedAug: Reducing the Local Learning Bias Improves Federated Learning on Heterogeneous Data. (arXiv:2205.13462v1 [cs.LG])
    Federated Learning (FL) is a machine learning paradigm that learns from data kept locally to safeguard the privacy of clients, whereas local SGD is typically employed on the clients' devices to improve communication efficiency. However, such a scheme is currently constrained by the slow and unstable convergence induced by clients' heterogeneous data. In this work, we identify three under-explored phenomena of the biased local learning that may explain these challenges caused by local updates in supervised FL. As a remedy, we propose FedAug, a novel unified algorithm that reduces the local learning bias on features and classifiers to tackle these challenges. FedAug consists of two components: AugMean and AugCA. AugMean alleviates the bias in the local classifiers by balancing the output distribution of models. AugCA learns client invariant features that are close to global features but considerably distinct from those learned from other input distributions. In a series of experiments, we show that FedAug consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components (i.e., AugMean and AugCA) have individual performance gains.
    SigMaNet: One Laplacian to Rule Them All. (arXiv:2205.13459v1 [cs.LG])
    This paper introduces SigMaNet, a generalized Graph Convolutional Network (GCN) capable of handling both undirected and directed graphs with weights not restricted in sign and magnitude. The cornerstone of SigMaNet is the introduction of a generalized Laplacian matrix: the Sign-Magnetic Laplacian ($L^\sigma$). The adoption of such a matrix allows us to bridge a gap in the current literature by extending the theory of spectral GCNs to directed graphs with both positive and negative weights. $L^{\sigma}$ exhibits several desirable properties not enjoyed by the traditional Laplacian matrices on which several state-of-the-art architectures are based. In particular, $L^\sigma$ is completely parameter-free, which is not the case of Laplacian operators such as the Magnetic Laplacian $L^{(q)}$, where the calibration of the parameter q is an essential yet problematic component of the operator. $L^\sigma$ simplifies the approach, while also allowing for a natural interpretation of the signs of the edges in terms of their directions. The versatility of the proposed approach is amply demonstrated experimentally; the proposed network SigMaNet turns out to be competitive in all the tasks we considered, regardless of the graph structure.
    SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation. (arXiv:2205.13490v1 [cs.CV])
    Conventional point cloud semantic segmentation methods usually employ an encoder-decoder architecture, where mid-level features are locally aggregated to extract geometric information. However, the over-reliance on these class-agnostic local geometric representations may raise confusion between local parts from different categories that are similar in appearance or spatially adjacent. To address this issue, we argue that mid-level features can be further enhanced with semantic information, and propose semantic-affine transformation that transforms features of mid-level points belonging to different categories with class-specific affine parameters. Based on this technique, we propose SemAffiNet for point cloud semantic segmentation, which utilizes the attention mechanism in the Transformer module to implicitly and explicitly capture global structural knowledge within local parts for overall comprehension of each category. We conduct extensive experiments on the ScanNetV2 and NYUv2 datasets, and evaluate semantic-affine transformation on various 3D point cloud and 2D image segmentation baselines, where both qualitative and quantitative results demonstrate the superiority and generalization ability of our proposed approach. Code is available at https://github.com/wangzy22/SemAffiNet.
    An Analytic Framework for Robust Training of Artificial Neural Networks. (arXiv:2205.13502v1 [cs.LG])
    The reliability of a learning model is key to the successful deployment of machine learning in various industries. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. Consequently, many studies investigate the phenomenon by proposing a simplified model of how adversarial examples occur and validate it by predicting some aspect of the phenomenon. While these studies cover many different characteristics of the adversarial examples, they have not reached a holistic approach to the geometric and analytic modeling of the phenomenon. This paper propose a formal framework to study the phenomenon in learning theory and make use of complex analysis and holomorphicity to offer a robust learning rule for artificial neural networks. With the help of complex analysis, we can effortlessly move between geometric and analytic perspectives of the phenomenon and offer further insights on the phenomenon by revealing its connection with harmonic functions. Using our model, we can explain some of the most intriguing characteristics of adversarial examples, including transferability of adversarial examples, and pave the way for novel approaches to mitigate the effects of the phenomenon.
    Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. (arXiv:2205.13445v1 [cs.CV])
    Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.
    DeepJoint: Robust Survival Modelling Under Clinical Presence Shift. (arXiv:2205.13481v1 [cs.LG])
    Observational data in medicine arise as a result of the complex interaction between patients and the healthcare system. The sampling process is often highly irregular and itself constitutes an informative process. When using such data to develop prediction models, this phenomenon is often ignored, leading to sub-optimal performance and generalisability of models when practices evolve. We propose a multi-task recurrent neural network which models three clinical presence dimensions -- namely the longitudinal, the inter-observation and the missingness processes -- in parallel to the survival outcome. On a prediction task using MIMIC III laboratory tests, explicit modelling of these three processes showed improved performance in comparison to state-of-the-art predictive models (C-index at 1 day horizon: 0.878). More importantly, the proposed approach was more robust to change in the clinical presence setting, demonstrated by performance comparison between patients admitted on weekdays and weekends. This analysis demonstrates the importance of studying and leveraging clinical presence to improve performance and create more transportable clinical models.
    Machine Learning Models Are Not Necessarily Biased When Constructed Properly: Evidence from Neuroimaging Studies. (arXiv:2205.13421v1 [cs.LG])
    Despite the great promise that machine learning has offered in many fields of medicine, it has also raised concerns about potential biases and poor generalization across genders, age distributions, races and ethnicities, hospitals, and data acquisition equipment and protocols. In the current study, and in the context of three brain diseases, we provide experimental data which support that when properly trained, machine learning models can generalize well across diverse conditions and do not suffer from biases. Specifically, by using multi-study magnetic resonance imaging consortia for diagnosing Alzheimer's disease, schizophrenia, and autism spectrum disorder, we find that, the accuracy of well-trained models is consistent across different subgroups pertaining to attributes such as gender, age, and racial groups, as also different clinical studies. We find that models that incorporate multi-source data from demographic, clinical, genetic factors and cognitive scores are also unbiased. These models have better predictive accuracy across subgroups than those trained only with structural measures in some cases but there are also situations when these additional features do not help.
    A Fair Federated Learning Framework With Reinforcement Learning. (arXiv:2205.13415v1 [cs.LG])
    Federated learning (FL) is a paradigm where many clients collaboratively train a model under the coordination of a central server, while keeping the training data locally stored. However, heterogeneous data distributions over different clients remain a challenge to mainstream FL algorithms, which may cause slow convergence, overall performance degradation and unfairness of performance across clients. To address these problems, in this study we propose a reinforcement learning framework, called PG-FFL, which automatically learns a policy to assign aggregation weights to clients. Additionally, we propose to utilize Gini coefficient as the measure of fairness for FL. More importantly, we apply the Gini coefficient and validation accuracy of clients in each communication round to construct a reward function for the reinforcement learning. Our PG-FFL is also compatible to many existing FL algorithms. We conduct extensive experiments over diverse datasets to verify the effectiveness of our framework. The experimental results show that our framework can outperform baseline methods in terms of overall performance, fairness and convergence speed.
    Avoiding Barren Plateaus with Classical Deep Neural Networks. (arXiv:2205.13418v1 [quant-ph])
    Variational quantum algorithms (VQAs) are among the most promising algorithms in the era of Noisy Intermediate Scale Quantum Devices. The VQAs are applied to a variety of tasks, such as in chemistry simulations, optimization problems, and quantum neural networks. Such algorithms are constructed using a parameterization U($\pmb{\theta}$) with a classical optimizer that updates the parameters $\pmb{\theta}$ in order to minimize a cost function $C$. For this task, in general the gradient descent method, or one of its variants, is used. This is a method where the circuit parameters are updated iteratively using the cost function gradient. However, several works in the literature have shown that this method suffers from a phenomenon known as the Barren Plateaus (BP). This phenomenon is characterized by the exponentially flattening of the cost function landscape, so that the number of times the function must be evaluated to perform the optimization grows exponentially as the number of qubits and parameterization depth increase. In this article, we report on how the use of a classical neural networks in the VQAs input parameters can alleviate the BP phenomenon.
    TransBoost: Improving the Best ImageNet Performance using Deep Transduction. (arXiv:2205.13331v1 [cs.CV])
    This paper deals with deep transductive learning, and proposes TransBoost as a procedure for fine-tuning any deep neural model to improve its performance on any (unlabeled) test set provided at training time. TransBoost is inspired by a large margin principle and is efficient and simple to use. The ImageNet classification performance is consistently and significantly improved with TransBoost on many architectures such as ResNets, MobileNetV3-L, EfficientNetB0, ViT-S, and ConvNext-T. Additionally we show that TransBoost is effective on a wide variety of image classification datasets.
    BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning. (arXiv:2205.13383v1 [cs.CV])
    Deep neural networks are vulnerable to Trojan attacks. Existing attacks use visible patterns (e.g., a patch or image transformations) as triggers, which are vulnerable to human inspection. In this paper, we propose stealthy and efficient Trojan attacks, BppAttack. Based on existing biology literature on human visual systems, we propose to use image quantization and dithering as the Trojan trigger, making imperceptible changes. It is a stealthy and efficient attack without training auxiliary models. Due to the small changes made to images, it is hard to inject such triggers during training. To alleviate this problem, we propose a contrastive learning based approach that leverages adversarial attacks to generate negative sample pairs so that the learned trigger is precise and accurate. The proposed method achieves high attack success rates on four benchmark datasets, including MNIST, CIFAR-10, GTSRB, and CelebA. It also effectively bypasses existing Trojan defenses and human inspection. Our code can be found in https://github.com/RU-System-Software-and-Security/BppAttack.
    Opinion Spam Detection: A New Approach Using Machine Learning and Network-Based Algorithms. (arXiv:2205.13422v1 [cs.LG])
    E-commerce is the fastest-growing segment of the economy. Online reviews play a crucial role in helping consumers evaluate and compare products and services. As a result, fake reviews (opinion spam) are becoming more prevalent and negatively impacting customers and service providers. There are many reasons why it is hard to identify opinion spammers automatically, including the absence of reliable labeled data. This limitation precludes an off-the-shelf application of a machine learning pipeline. We propose a new method for classifying reviewers as spammers or benign, combining machine learning with a message-passing algorithm that capitalizes on the users' graph structure to compensate for the possible scarcity of labeled data. We devise a new way of sampling the labels for the training step (active learning), replacing the typical uniform sampling. Experiments on three large real-world datasets from Yelp.com show that our method outperforms state-of-the-art active learning approaches and also machine learning methods that use a much larger set of labeled data for training.
    Looking for Out-of-Distribution Environments in Critical Care: A case study with the eICU Database. (arXiv:2205.13398v1 [cs.LG])
    Generalizing to new populations and domains in machine learning is still an open problem which has seen increased interest recently. In particular, clinical models show a significant performance drop when tested in settings not seen during training, e.g., new hospitals or population demographics. Recently proposed models for domain generalisation promise to alleviate this problem by learning invariant characteristics across environments, however, there is still scepticism about whether they improve over traditional training. In this work, we take a principled approach to identifying Out of Distribution (OoD) environments, motivated by the problem of cross-hospital generalization in critical care. We propose model-based and heuristic approaches to identify OoD environments and systematically compare models with different levels of held-out information. In particular, based on the assumption that models with access to OoD data should outperform other models, we train models across a range of experimental setups that include leave-one-hospital-out training and cross-sectional feature splits. We find that access to OoD data does not translate to increased performance, pointing to inherent limitations in defining potential OoD environments in the eICU Database potentially due to data harmonisation and sampling. Echoing similar results with other popular clinical benchmarks in the literature, new approaches are required to evaluate robust models in critical care.
    Transfer and Share: Semi-Supervised Learning from Long-Tailed Data. (arXiv:2205.13358v1 [cs.LG])
    Long-Tailed Semi-Supervised Learning (LTSSL) aims to learn from class-imbalanced data where only a few samples are annotated. Existing solutions typically require substantial cost to solve complex optimization problems, or class-balanced undersampling which can result in information loss. In this paper, we present the TRAS (TRAnsfer and Share) to effectively utilize long-tailed semi-supervised data. TRAS transforms the imbalanced pseudo-label distribution of a traditional SSL model via a delicate function to enhance the supervisory signals for minority classes. It then transfers the distribution to a target model such that the minority class will receive significant attention. Interestingly, TRAS shows that more balanced pseudo-label distribution can substantially benefit minority-class training, instead of seeking to generate accurate pseudo-labels as in previous works. To simplify the approach, TRAS merges the training of the traditional SSL model and the target model into a single procedure by sharing the feature extractor, where both classifiers help improve the representation learning. According to extensive experiments, TRAS delivers much higher accuracy than state-of-the-art methods in the entire set of classes as well as minority classes.
    Feature Forgetting in Continual Representation Learning. (arXiv:2205.13359v1 [cs.LG])
    In continual and lifelong learning, good representation learning can help increase performance and reduce sample complexity when learning new tasks. There is evidence that representations do not suffer from "catastrophic forgetting" even in plain continual learning, but little further fact is known about its characteristics. In this paper, we aim to gain more understanding about representation learning in continual learning, especially on the feature forgetting problem. We devise a protocol for evaluating representation in continual learning, and then use it to present an overview of the basic trends of continual representation learning, showing its consistent deficiency and potential issues. To study the feature forgetting problem, we create a synthetic dataset to identify and visualize the prevalence of feature forgetting in neural networks. Finally, we propose a simple technique using gating adapters to mitigate feature forgetting. We conclude by discussing that improving representation learning benefits both old and new tasks in continual learning.
    How Powerful are K-hop Message Passing Graph Neural Networks. (arXiv:2205.13328v1 [cs.LG])
    The most popular design paradigm for Graph Neural Networks (GNNs) is 1-hop message passing -- aggregating features from 1-hop neighbors repeatedly. However, the expressive power of 1-hop message passing is bounded by the Weisfeiler-Lehman (1-WL) test. Recently, researchers extended 1-hop message passing to K-hop message passing by aggregating information from K-hop neighbors of nodes simultaneously. However, there is no work on analyzing the expressive power of K-hop message passing. In this work, we theoretically characterize the expressive power of K-hop message passing. Specifically, we first formally differentiate two kinds of kernels of K-hop message passing which are often misused in previous works. We then characterize the expressive power of K-hop message passing by showing that it is more powerful than 1-hop message passing. Despite the higher expressive power, we show that K-hop message passing still cannot distinguish some simple regular graphs. To further enhance its expressive power, we introduce a KP-GNN framework, which improves K-hop message passing by leveraging the peripheral subgraph information in each hop. We prove that KP-GNN can distinguish almost all regular graphs including some distance regular graphs which could not be distinguished by previous distance encoding methods. Experimental results verify the expressive power and effectiveness of KP-GNN. KP-GNN achieves competitive results across all benchmark datasets.
    Deep Active Learning with Noise Stability. (arXiv:2205.13340v1 [cs.LG])
    Uncertainty estimation for unlabeled data is crucial to active learning. With a deep neural network employed as the backbone model, the data selection process is highly challenging due to the potential over-confidence of the model inference. Existing methods resort to special learning fashions (e.g. adversarial) or auxiliary models to address this challenge. This tends to result in complex and inefficient pipelines, which would render the methods impractical. In this work, we propose a novel algorithm that leverages noise stability to estimate data uncertainty in a Single-Training Multi-Inference fashion. The key idea is to measure the output derivation from the original observation when the model parameters are randomly perturbed by noise. We provide theoretical analyses by leveraging the small Gaussian noise theory and demonstrate that our method favors a subset with large and diverse gradients. Despite its simplicity, our method outperforms the state-of-the-art active learning baselines in various tasks, including computer vision, natural language processing, and structural data analysis.
    QUICK-FL: Quick Unbiased Compression for Federated Learning. (arXiv:2205.13341v1 [cs.LG])
    Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.
    Acute Lymphoblastic Leukemia Detection Using Hypercomplex-Valued Convolutional Neural Networks. (arXiv:2205.13273v1 [cs.CV])
    This paper features convolutional neural networks defined on hypercomplex algebras applied to classify lymphocytes in blood smear digital microscopic images. Such classification is helpful for the diagnosis of acute lymphoblast leukemia (ALL), a type of blood cancer. We perform the classification task using eight hypercomplex-valued convolutional neural networks (HvCNNs) along with real-valued convolutional networks. Our results show that HvCNNs perform better than the real-valued model, showcasing higher accuracy with a much smaller number of parameters. Moreover, we found that HvCNNs based on Clifford algebras processing HSV-encoded images attained the highest observed accuracies. Precisely, our HvCNN yielded an average accuracy rate of 96.6% using the ALL-IDB2 dataset with a 50% train-test split, a value extremely close to the state-of-the-art models but using a much simpler architecture with significantly fewer parameters.
    On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual Recognition. (arXiv:2205.13282v1 [cs.CV])
    The Fine-Grained Visual Categorization (FGVC) is challenging because the subtle inter-class variations are difficult to be captured. One notable research line uses the Global Covariance Pooling (GCP) layer to learn powerful representations with second-order statistics, which can effectively model inter-class differences. In our previous conference paper, we show that truncating small eigenvalues of the GCP covariance can attain smoother gradient and improve the performance on large-scale benchmarks. However, on fine-grained datasets, truncating the small eigenvalues would make the model fail to converge. This observation contradicts the common assumption that the small eigenvalues merely correspond to the noisy and unimportant information. Consequently, ignoring them should have little influence on the performance. To diagnose this peculiar behavior, we propose two attribution methods whose visualizations demonstrate that the seemingly unimportant small eigenvalues are crucial as they are in charge of extracting the discriminative class-specific features. Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues. Without introducing any additional parameters, this branch simply amplifies the small eigenvalues and achieves state-of-the-art performances of GCP methods on three fine-grained benchmarks. Furthermore, the performance is also competitive against other FGVC approaches on larger datasets. Code is available at \href{https://github.com/KingJamesSong/DifferentiableSVD}{https://github.com/KingJamesSong/DifferentiableSVD}.
    Fair Representation Learning through Implicit Path Alignment. (arXiv:2205.13316v1 [cs.LG])
    We consider a fair representation learning perspective, where optimal predictors, on top of the data representation, are ensured to be invariant with respect to different sub-groups. Specifically, we formulate this intuition as a bi-level optimization, where the representation is learned in the outer-loop, and invariant optimal group predictors are updated in the inner-loop. Moreover, the proposed bi-level objective is demonstrated to fulfill the sufficiency rule, which is desirable in various practical scenarios but was not commonly studied in the fair learning. Besides, to avoid the high computational and memory cost of differentiating in the inner-loop of bi-level objective, we propose an implicit path alignment algorithm, which only relies on the solution of inner optimization and the implicit differentiation rather than the exact optimization path. We further analyze the error gap of the implicit approach and empirically validate the proposed method in both classification and regression settings. Experimental results show the consistently better trade-off in prediction performance and fairness measurement.
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v1 [cs.LG])
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild. Our extensive experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.
    Investigating classification learning curves for automatically generated and labelled plant images. (arXiv:2205.10955v2 [cs.LG] UPDATED)
    In the context of supervised machine learning a learning curve describes how a model's performance on unseen data relates to the amount of samples used to train the model. In this paper we present a dataset of plant images with representatives of crops and weeds common to the Manitoba prairies at different growth stages. We determine the learning curve for a classification task on this data with the ResNet architecture. Our results are in accordance with previous studies and add to the evidence that learning curves are governed by power-law relationships over large scales, applications, and models. We further investigate how label noise and the reduction of trainable parameters impacts the learning curve on this dataset. Both effects lead to the model requiring disproportionally larger training sets to achieve the same classification performance as observed without these effects.
    SARS-CoV-2 Result Interpretation based on Image Analysis of Lateral Flow Devices. (arXiv:2205.13311v1 [cs.LG])
    The widely used gene quantisation technique, Lateral Flow Device (LFD), is now commonly used to detect the presence of SARS-CoV-2. It is enabling the control and prevention of the spread of the virus. Depending on the viral load, LFD have different sensitivity and self-test for normal user present additional challenge to interpret the result. With the evolution of machine learning algorithms, image processing and analysis has seen unprecedented growth. In this interdisciplinary study, we employ novel image analysis methods of computer vision and machine learning field to study visual features of the control region of LFD. Here, we automatically derive results for any image containing LFD into positive, negative or inconclusive. This will reduce the burden of human involvement of health workers and perception bias.
    Triangular Contrastive Learning on Molecular Graphs. (arXiv:2205.13279v1 [cs.LG])
    Recent contrastive learning methods have shown to be effective in various tasks, learning generalizable representations invariant to data augmentation thereby leading to state of the art performances. Regarding the multifaceted nature of large unlabeled data used in self-supervised learning while majority of real-word downstream tasks use single format of data, a multimodal framework that can train single modality to learn diverse perspectives from other modalities is an important challenge. In this paper, we propose TriCL (Triangular Contrastive Learning), a universal framework for trimodal contrastive learning. TriCL takes advantage of Triangular Area Loss, a novel intermodal contrastive loss that learns the angular geometry of the embedding space through simultaneously contrasting the area of positive and negative triplets. Systematic observation on embedding space in terms of alignment and uniformity showed that Triangular Area Loss can address the line-collapsing problem by discriminating modalities by angle. Our experimental results also demonstrate the outperformance of TriCL on downstream task of molecular property prediction which implies that the advantages of the embedding space indeed benefits the performance on downstream tasks.
    Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension. (arXiv:2205.13303v1 [stat.ML])
    While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.
    Privacy-Preserving Wavelet Wavelet Neural Network with Fully Homomorphic Encryption. (arXiv:2205.13265v1 [cs.LG])
    The main aim of Privacy-Preserving Machine Learning (PPML) is to protect the privacy and provide security to the data used in building Machine Learning models. There are various techniques in PPML such as Secure Multi-Party Computation, Differential Privacy, and Homomorphic Encryption (HE). The techniques are combined with various Machine Learning models and even Deep Learning Networks to protect the data privacy as well as the identity of the user. In this paper, we propose a fully homomorphic encrypted wavelet neural network to protect privacy and at the same time not compromise on the efficiency of the model. We tested the effectiveness of the proposed method on seven datasets taken from the finance and healthcare domains. The results show that our proposed model performs similarly to the unencrypted model.
    SymNMF-Net for The Symmetric NMF Problem. (arXiv:2205.13214v1 [cs.LG])
    Recently, many works have demonstrated that Symmetric Non-negative Matrix Factorization~(SymNMF) enjoys a great superiority for various clustering tasks. Although the state-of-the-art algorithms for SymNMF perform well on synthetic data, they cannot consistently obtain satisfactory results with desirable properties and may fail on real-world tasks like clustering. Considering the flexibility and strong representation ability of the neural network, in this paper, we propose a neural network called SymNMF-Net for the Symmetric NMF problem to overcome the shortcomings of traditional optimization algorithms. Each block of SymNMF-Net is a differentiable architecture with an inversion layer, a linear layer and ReLU, which are inspired by a traditional update scheme for SymNMF. We show that the inference of each block corresponds to a single iteration of the optimization. Furthermore, we analyze the constraints of the inversion layer to ensure the output stability of the network to a certain extent. Empirical results on real-world datasets demonstrate the superiority of our SymNMF-Net and confirm the sufficiency of our theoretical analysis.
    Penalizing Proposals using Classifiers for Semi-Supervised Object Detection. (arXiv:2205.13219v1 [cs.CV])
    Obtaining gold standard annotated data for object detection is often costly, involving human-level effort. Semi-supervised object detection algorithms solve the problem with a small amount of gold-standard labels and a large unlabelled dataset used to generate silver-standard labels. But training on the silver standard labels does not produce good results, because they are machine-generated annotations. In this work, we design a modified loss function to train on large silver standard annotated sets generated by a weak annotator. We include a confidence metric associated with the annotation as an additional term in the loss function, signifying the quality of the annotation. We test the effectiveness of our approach on various test sets and use numerous variations to compare the results with some of the current approaches to object detection. In comparison with the baseline where no confidence metric is used, we achieved a 4\% gain in mAP with 25\% labeled data and 10\% gain in mAP with 50\% labeled data by using the proposed confidence metric.
    Evaluating Multimodal Interactive Agents. (arXiv:2205.13274v1 [cs.LG])
    Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. https://youtu.be/YR1TngGORGQ
    Federated Split BERT for Heterogeneous Text Classification. (arXiv:2205.13299v1 [cs.CL])
    Pre-trained BERT models have achieved impressive performance in many natural language processing (NLP) tasks. However, in many real-world situations, textual data are usually decentralized over many clients and unable to be uploaded to a central server due to privacy protection and regulations. Federated learning (FL) enables multiple clients collaboratively to train a global model while keeping the local data privacy. A few researches have investigated BERT in federated learning setting, but the problem of performance loss caused by heterogeneous (e.g., non-IID) data over clients remain under-explored. To address this issue, we propose a framework, FedSplitBERT, which handles heterogeneous data and decreases the communication cost by splitting the BERT encoder layers into local part and global part. The local part parameters are trained by the local client only while the global part parameters are trained by aggregating gradients of multiple clients. Due to the sheer size of BERT, we explore a quantization method to further reduce the communication cost with minimal performance loss. Our framework is ready-to-use and compatible to many existing federated learning algorithms, including FedAvg, FedProx and FedAdam. Our experiments verify the effectiveness of the proposed framework, which outperforms baseline methods by a significant margin, while FedSplitBERT with quantization can reduce the communication cost by $11.9\times$.
    The Effect of Task Ordering in Continual Learning. (arXiv:2205.13323v1 [cs.LG])
    We investigate the effect of task ordering on continual learning performance. We conduct an extensive series of empirical experiments on synthetic and naturalistic datasets and show that reordering tasks significantly affects the amount of catastrophic forgetting. Connecting to the field of curriculum learning, we show that the effect of task ordering can be exploited to modify continual learning performance, and present a simple approach for doing so. Our method computes the distance between all pairs of tasks, where distance is defined as the source task curvature of a gradient step toward the target task. Using statistically rigorous methods and sound experimental design, we show that task ordering is an important aspect of continual learning that can be modified for improved performance.
    DeepTechnome: Mitigating Unknown Bias in Deep Learning Based Assessment of CT Images. (arXiv:2205.13297v1 [eess.IV])
    Reliably detecting diseases using relevant biological information is crucial for real-world applicability of deep learning techniques in medical imaging. We debias deep learning models during training against unknown bias - without preprocessing/filtering the input beforehand or assuming specific knowledge about its distribution or precise nature in the dataset. We use control regions as surrogates that carry information regarding the bias, employ the classifier model to extract features, and suppress biased intermediate features with our custom, modular DecorreLayer. We evaluate our method on a dataset of 952 lung computed tomography scans by introducing simulated biases w.r.t. reconstruction kernel and noise level and propose including an adversarial test set in evaluations of bias reduction techniques. In a moderately sized model architecture, applying the proposed method to learn from data exhibiting a strong bias, it near-perfectly recovers the classification performance observed when training with corresponding unbiased data.
    DT-SV: A Transformer-based Time-domain Approach for Speaker Verification. (arXiv:2205.13249v1 [cs.SD])
    Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models.
    Collaborative Distillation Meta Learning for Simulation Intensive Hardware Design. (arXiv:2205.13225v1 [cs.LG])
    This paper proposes a novel collaborative distillation meta learning (CDML) framework for simulation intensive hardware design problems. Deep reinforcement learning (DRL) has shown promising performance in various hardware design problems. However, previous works on DRL-based hardware design only dealt with problems with simplified objectives, which are not practical. In fact, the objective evaluation of real-world electrical performance through simulation is costly in terms of both time and computation, making DRL scheme involving extensive reward calculations not suitable. In this paper, we apply the CDML framework to decoupling capacitor placement problem (DPP), one of the significant simulation intensive hardware design problems. The CDML framework consists of a context-based meta learner and collaborative distillation scheme to produce a reusable solver. The context-based meta learner captures the location of probing port (i.e., target circuit block) and improves generalization capability. The collaborative distillation scheme with equivariant label transformation imposes the action-permutation (AP)-equivariant nature of placement problems, which not only improves sample efficiency but also improves generalization capability. Extensive experimental results verified that our CDML outperforms both neural baselines and iterative conventional design methods in terms of real-world objective, power integrity, with zero-shot transfer-ability.
    A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. (arXiv:2205.13218v1 [cs.LG])
    Real-world applications require the classification model to adapt to new classes without forgetting old ones. Correspondingly, Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. Typical CIL methods tend to save representative exemplars from former classes to resist forgetting, while recent works find that storing models from history can substantially boost the performance. However, the stored models are not counted into the memory budget, which implicitly results in unfair comparisons. We find that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work, especially for the case with limited memory budgets. As a result, we need to holistically evaluate different CIL methods at different memory scales and simultaneously consider accuracy and memory size for measurement. On the other hand, we dive deeply into the construction of the memory buffer for memory efficiency. By analyzing the effect of different layers in the network, we find that shallow and deep layers have different characteristics in CIL. Motivated by this, we propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel. MEMO extends specialized layers based on the shared generalized representations, efficiently extracting diverse representations with modest cost and maintaining representative exemplars. Extensive experiments on benchmark datasets validate MEMO's competitive performance.
    Continual Feature Selection: Spurious Features in Continual Learning. (arXiv:2203.01012v2 [cs.LG] UPDATED)
    Continual Learning (CL) is the research field addressing learning without forgetting when the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. We show that continual learning algorithms solve tasks by selecting features that are not generalizable. Our experiments highlight that continual learning algorithms face two related problems: (1) spurious features and (2) local spurious features. The first one is due to a covariate shift between training and testing data, while the second is due to the limited access to data at each training step. We study (1) through a consistent set of continual learning experiments varying spurious correlation amount and data distribution support. We show that (2) is a major cause of performance decrease in continual learning along with catastrophic forgetting. This paper presents a different way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms capabilities.
    QSpeech: Low-Qubit Quantum Speech Application Toolkit. (arXiv:2205.13221v1 [quant-ph])
    Quantum devices with low qubits are common in the Noisy Intermediate-Scale Quantum (NISQ) era. However, Quantum Neural Network (QNN) running on low-qubit quantum devices would be difficult since it is based on Variational Quantum Circuit (VQC), which requires many qubits. Therefore, it is critical to make QNN with VQC run on low-qubit quantum devices. In this study, we propose a novel VQC called the low-qubit VQC. VQC requires numerous qubits based on the input dimension; however, the low-qubit VQC with linear transformation can liberate this condition. Thus, it allows the QNN to run on low-qubit quantum devices for speech applications. Furthermore, as compared to the VQC, our proposed low-qubit VQC can stabilize the training process more. Based on the low-qubit VQC, we implement QSpeech, a library for quick prototyping of hybrid quantum-classical neural networks in the speech field. It has numerous quantum neural layers and QNN models for speech applications. Experiments on Speech Command Recognition and Text-to-Speech show that our proposed low-qubit VQC outperforms VQC and is more stable.
    Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. (arXiv:2205.11508v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities\dots the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning to address those limitations. Through the course of this study, we will rigorously demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, Multidimensional Scaling et al. This unification will then allow us to obtain (i) the closed-form optimal representation for each method, (ii) the closed-form optimal network parameters in the linear regime for each method, (iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly, (iv) the first theoretical bridge between contrastive and non-contrastive methods towards global and local spectral embedding methods respectively, hinting at the benefits and limitations of each. For example, (i) if the pairwise relation is aligned with the downstream task, any SSL method can be employed successfully and will recover the supervised method, but in the low data regime, VICReg's invariance hyper-parameter should be high; (ii) if the pairwise relation is misaligned with the downstream task, VICReg with small invariance hyper-parameter should be preferred over SimCLR or BarlowTwins.
    Mask-based Latent Reconstruction for Reinforcement Learning. (arXiv:2201.12096v2 [cs.LG] UPDATED)
    For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance. However, in practice, limited experience and high-dimensional input prevent effective representation learning. To address this, motivated by the success of masked modeling in other research fields, we introduce mask-based reconstruction to promote state representation learning in RL. Specifically, we propose a simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels. MLR enables the better use of context information when learning state representations to make them more informative, which facilitates RL agent training. Extensive experiments show that our MLR significantly improves the sample efficiency in RL and outperforms the state-of-the-art sample-efficient RL methods on multiple continuous and discrete control benchmarks. The code will be released soon.
    The Shapley Value in Machine Learning. (arXiv:2202.05594v2 [cs.LG] UPDATED)
    Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research.
    Denial-of-Service Attacks on Learned Image Compression. (arXiv:2205.13253v1 [cs.CV])
    Deep learning techniques have shown promising results in image compression, with competitive bitrate and image reconstruction quality from compressed latent. However, while image compression has progressed towards higher peak signal-to-noise ratio (PSNR) and fewer bits per pixel (bpp), their robustness to corner-case images has never received deliberation. In this work, we, for the first time, investigate the robustness of image compression systems where imperceptible perturbation of input images can precipitate a significant increase in the bitrate of their compressed latent. To characterize the robustness of state-of-the-art learned image compression, we mount white and black-box attacks. Our results on several image compression models with various bitrate qualities show that they are surprisingly fragile, where the white-box attack achieves up to 56.326x and black-box 1.947x bpp change. To improve robustness, we propose a novel model which incorporates attention modules and a basic factorized entropy model, resulting in a promising trade-off between the PSNR/bpp ratio and robustness to adversarial attacks that surpasses existing learned image compressors.
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v1 [cs.LG])
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which generalizes active learning based on partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over number of samples. We illustrate our technique in depth for robust regression.
    Trajectory-Constrained Deep Latent Visual Attention for Improved Local Planning in Presence of Heterogeneous Terrain. (arXiv:2112.04684v3 [cs.RO] UPDATED)
    We present a reward-predictive, model-based deep learning method featuring trajectory-constrained visual attention for local planning in visual navigation tasks. Our method learns to place visual attention at locations in latent image space which follow trajectories caused by vehicle control actions to enhance predictive accuracy during planning. The attention model is jointly optimized by the task-specific loss and an additional trajectory-constraint loss, allowing adaptability yet encouraging a regularized structure for improved generalization and reliability. Importantly, visual attention is applied in latent feature map space instead of raw image space to promote efficient planning. We validated our model in visual navigation tasks of planning low turbulence, collision-free trajectories in off-road settings and hill climbing with locking differentials in the presence of slippery terrain. Experiments involved randomized procedural generated simulation and real-world environments. We found our method improved generalization and learning efficiency when compared to no-attention and self-attention alternatives.
    Orthogonal Stochastic Configuration Networks with Adaptive Construction Parameter for Data Analytics. (arXiv:2205.13191v1 [cs.LG])
    As a randomized learner model, SCNs are remarkable that the random weights and biases are assigned employing a supervisory mechanism to ensure universal approximation and fast learning. However, the randomness makes SCNs more likely to generate approximate linear correlative nodes that are redundant and low quality, thereby resulting in non-compact network structure. In the light of a fundamental principle in machine learning, that is, a model with fewer parameters holds improved generalization. This paper proposes orthogonal SCN, termed OSCN, to filtrate out the low-quality hidden nodes for network structure reduction by incorporating Gram-Schmidt orthogonalization technology. The universal approximation property of OSCN and an adaptive setting for the key construction parameters have been presented in details. In addition, an incremental updating scheme is developed to dynamically determine the output weights, contributing to improved computational efficiency. Finally, experimental results on two numerical examples and several real-world regression and classification datasets substantiate the effectiveness and feasibility of the proposed approach.
    Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers. (arXiv:2107.03996v3 [cs.LG] UPDATED)
    We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoors and in the wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot. Our project page with videos is at https://rchalyang.github.io/LocoTransformer/ .
    $O(N^2)$ Universal Antisymmetry in Fermionic Neural Networks. (arXiv:2205.13205v1 [cs.LG])
    Fermionic neural network (FermiNet) is a recently proposed wavefunction Ansatz, which is used in variational Monte Carlo (VMC) methods to solve the many-electron Schr\"odinger equation. FermiNet proposes permutation-equivariant architectures, on which a Slater determinant is applied to induce antisymmetry. FermiNet is proved to have universal approximation capability with a single determinant, namely, it suffices to represent any antisymmetric function given sufficient parameters. However, the asymptotic computational bottleneck comes from the Slater determinant, which scales with $O(N^3)$ for $N$ electrons. In this paper, we substitute the Slater determinant with a pairwise antisymmetry construction, which is easy to implement and can reduce the computational cost to $O(N^2)$. Furthermore, we formally prove that the pairwise construction built upon permutation-equivariant architectures can universally represent any antisymmetric function.
    On Learning Mixture of Linear Regressions in the Non-Realizable Setting. (arXiv:2205.13166v1 [stat.ML])
    While mixture of linear regressions (MLR) is a well-studied topic, prior works usually do not analyze such models for prediction error. In fact, {\em prediction} and {\em loss} are not well-defined in the context of mixtures. In this paper, first we show that MLR can be used for prediction where instead of predicting a label, the model predicts a list of values (also known as {\em list-decoding}). The list size is equal to the number of components in the mixture, and the loss function is defined to be minimum among the losses resulted by all the component models. We show that with this definition, a solution of the empirical risk minimization (ERM) achieves small probability of prediction error. This begs for an algorithm to minimize the empirical risk for MLR, which is known to be computationally hard. Prior algorithmic works in MLR focus on the {\em realizable} setting, i.e., recovery of parameters when data is probabilistically generated by a mixed linear (noisy) model. In this paper we show that a version of the popular alternating minimization (AM) algorithm finds the best fit lines in a dataset even when a realizable model is not assumed, under some regularity conditions on the dataset and the initial points, and thereby provides a solution for the ERM. We further provide an algorithm that runs in polynomial time in the number of datapoints, and recovers a good approximation of the best fit lines. The two algorithms are experimentally compared.
    AI for Porosity and Permeability Prediction from Geologic Core X-Ray Micro-Tomography. (arXiv:2205.13189v1 [cs.LG])
    Geologic cores are rock samples that are extracted from deep under the ground during the well drilling process. They are used for petroleum reservoirs' performance characterization. Traditionally, physical studies of cores are carried out by the means of manual time-consuming experiments. With the development of deep learning, scientists actively started working on developing machine-learning-based approaches to identify physical properties without any manual experiments. Several previous works used machine learning to determine the porosity and permeability of the rocks, but either method was inaccurate or computationally expensive. We are proposing to use self-supervised pretraining of the very small CNN-transformer-based model to predict the physical properties of the rocks with high accuracy in a time-efficient manner. We show that this technique prevents overfitting even for extremely small datasets.
    RENs: Relevance Encoding Networks. (arXiv:2205.13061v1 [cs.LG])
    The manifold assumption for high-dimensional data assumes that the data is generated by varying a set of parameters obtained from a low-dimensional latent space. Deep generative models (DGMs) are widely used to learn data representations in an unsupervised way. DGMs parameterize the underlying low-dimensional manifold in the data space using bottleneck architectures such as variational autoencoders (VAEs). The bottleneck dimension for VAEs is treated as a hyperparameter that depends on the dataset and is fixed at design time after extensive tuning. As the intrinsic dimensionality of most real-world datasets is unknown, often, there is a mismatch between the intrinsic dimensionality and the latent dimensionality chosen as a hyperparameter. This mismatch can negatively contribute to the model performance for representation learning and sample generation tasks. This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality. The relevance of each latent dimension is directly learned from the data along with the other model parameters using stochastic gradient descent and a reparameterization trick adapted to non-Gaussian priors. We leverage the concept of DeepSets to capture permutation invariant statistical properties in both data and latent spaces for relevance determination. The proposed framework is general and flexible and can be used for the state-of-the-art VAE models that leverage regularizers to impose specific characteristics in the latent space (e.g., disentanglement). With extensive experimentation on synthetic and public image datasets, we show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
    Evolutionary scheduling of university activities based on consumption forecasts to minimise electricity costs. (arXiv:2202.12595v2 [cs.LG] UPDATED)
    This paper presents a solution to a predict then optimise problem which goal is to reduce the electricity cost of a university campus. The proposed methodology combines a multi-dimensional time series forecast and a novel approach to large-scale optimization. Gradient-boosting method is applied to forecast both generation and consumption time-series of the Monash university campus for the month of November 2020. For the consumption forecasts we employ log transformation to model trend and stabilize variance. Additional seasonality and trend features are added to the model inputs when applicable. The forecasts obtained are used as the base load for the schedule optimisation of university activities and battery usage. The goal of the optimisation is to minimize the electricity cost consisting of the price of electricity and the peak electricity tariff both altered by the load from class activities and battery use as well as the penalty of not scheduling some optional activities. The schedule of the class activities is obtained through evolutionary optimisation using the covariance matrix adaptation evolution strategy and the genetic algorithm. This schedule is then improved through local search by testing possible times for each activity one-by-one. The battery schedule is formulated as a mixed-integer programming problem and solved by the Gurobi solver. This method obtains the second lowest cost when evaluated against 6 other methods presented at an IEEE competition that all used mixed-integer programming and the Gurobi solver to schedule both the activities and the battery use. The code and data used for the paper are publicly available.
    Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. (arXiv:2205.13134v1 [cs.AI])
    Nonlinear dynamics is ubiquitous in nature and commonly seen in various science and engineering disciplines. Distilling analytical expressions that govern nonlinear dynamics from limited data remains vital but challenging. To tackle this fundamental issue, we propose a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics. The key concept is to interpret mathematical operations and system state variables by computational rules and symbols, establish symbolic reasoning of mathematical formulas via expression trees, and employ a Monte Carlo tree search (MCTS) agent to explore optimal expression trees based on measurement data. The MCTS agent obtains an optimistic selection policy through the traversal of expression trees, featuring the one that maps to the arithmetic expression of underlying physics. Salient features of the proposed framework include search flexibility and enforcement of parsimony for discovered equations. The efficacy and superiority of the PSL machine are demonstrated by numerical examples, compared with state-of-the-art baselines.
    On the Evolution of A.I. and Machine Learning: Towards Measuring and Understanding Impact, Influence, and Leadership at Premier A.I. Conferences. (arXiv:2205.13131v1 [cs.AI])
    Artificial Intelligence is now recognized as a general-purpose technology with ample impact on human life. In this work, we aim to understand the evolution of AI and Machine learning over the years by analyzing researchers' impact, influence, and leadership over the last decades. This work also intends to shed new light on the history and evolution of AI by exploring the dynamics involved in the field's evolution through the lenses of the papers published on AI conferences since the first International Joint Conference on Artificial Intelligence (IJCAI) in 1969. AI development and evolution have led to increasing research output, reflected in the number of articles published over the last sixty years. We construct comprehensive citation-collaboration and paper-author datasets and compute corresponding centrality measures to carry out our analyses. These analyses allow a better understanding of how AI has reached its current state of affairs in research. Throughout the process, we correlate these datasets with the work of the ACM Turing Award winners and the so-called two AI winters the field has gone through. We also look at self-citation trends and new authors' behaviors. Finally, we present a novel way to infer the country of affiliation of a paper from its organization. Therefore, this work provides a deep analysis of Artificial Intelligence history from information gathered and analyzed from large technical venues datasets and suggests novel insights that can contribute to understanding and measuring AI's evolution.
    Cost-efficient Gaussian Tensor Network Embeddings for Tensor-structured Inputs. (arXiv:2205.13163v1 [math.NA])
    This work discusses tensor network embeddings, which are random matrices ($S$) with tensor network structure. These embeddings have been used to perform dimensionality reduction of tensor network structured inputs $x$ and accelerate applications such as tensor decomposition and kernel regression. Existing works have designed embeddings for inputs $x$ with specific structures, such that the computational cost for calculating $Sx$ is efficient. We provide a systematic way to design tensor network embeddings consisting of Gaussian random tensors, such that for inputs with more general tensor network structures, both the sketch size (row size of $S$) and the sketching computational cost are low. We analyze general tensor network embeddings that can be reduced to a sequence of sketching matrices. We provide a sufficient condition to quantify the accuracy of such embeddings and derive sketching asymptotic cost lower bounds using embeddings that satisfy this condition and have a sketch size lower than any input dimension. We then provide an algorithm to efficiently sketch input data using such embeddings. The sketch size of the embedding used in the algorithm has a linear dependence on the number of sketching dimensions of the input. Assuming tensor contractions are performed with classical dense matrix multiplication algorithms, this algorithm achieves asymptotic cost within a factor of $O(\sqrt{m})$ of our cost lower bound, where $m$ is the sketch size. Further, when each tensor in the input has a dimension that needs to be sketched, this algorithm yields the optimal sketching asymptotic cost. We apply our sketching analysis to inexact tensor decomposition optimization algorithms. We provide a sketching algorithm for CP decomposition that is asymptotically faster than existing work in multiple regimes, and show optimality of an existing algorithm for tensor train rounding.
    Reliably-stabilizing piecewise-affine neural network controllers. (arXiv:2111.07183v3 [eess.SY] UPDATED)
    A common problem affecting neural network (NN) approximations of model predictive control (MPC) policies is the lack of analytical tools to assess the stability of the closed-loop system under the action of the NN-based controller. We present a general procedure to quantify the performance of such a controller, or to design minimum complexity NNs with rectified linear units (ReLUs) that preserve the desirable properties of a given MPC scheme. By quantifying the approximation error between NN-based and MPC-based state-to-input mappings, we first establish suitable conditions involving two key quantities, the worst-case error and the Lipschitz constant, guaranteeing the stability of the closed-loop system. We then develop an offline, mixed-integer optimization-based method to compute those quantities exactly. Together these techniques provide conditions sufficient to certify the stability and performance of a ReLU-based approximation of an MPC control law.
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v3 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces \textit{The Neural Testbed}: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.
    Deep Generative Modeling for Volume Reconstruction in Cryo-Electron Microscopy. (arXiv:2201.02867v3 [eess.IV] UPDATED)
    Recent breakthroughs in high-resolution imaging of biomolecules in solution with cryo-electron microscopy (cryo-EM) have unlocked new doors for the reconstruction of molecular volumes, thereby promising further advances in biology, chemistry, and pharmacological research. Recent next-generation volume reconstruction algorithms that combine generative modeling with end-to-end unsupervised deep learning techniques have shown promising preliminary results, but still face considerable technical and theoretical hurdles when applied to experimental cryo-EM images. In light of the proliferation of such methods, we propose here a critical review of recent advances in the field of deep generative modeling for cryo-EM volume reconstruction. The present review aims to (i) unify and compare these new methods using a consistent statistical framework, (ii) present them using a terminology familiar to machine learning researchers and computational biologists with no specific background in cryo-EM, and (iii) provide the necessary perspective on current advances to highlight their relative strengths and weaknesses, along with outstanding bottlenecks and avenues for improvements in the field. This review might also raise the interest of computer vision practitioners, as it highlights significant limits of deep generative models in low signal-to-noise regimes -- therefore emphasizing a need for new theoretical and methodological developments.
    Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions. (arXiv:2205.13056v1 [stat.ML])
    Due to the drastic gap in complexity between sequential and batch statistical learning, recent work has studied a smoothed sequential learning setting, where Nature is constrained to select contexts with density bounded by 1/{\sigma} with respect to a known measure {\mu}. Unfortunately, for some function classes, there is an exponential gap between the statistically optimal regret and that which can be achieved efficiently. In this paper, we give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/{\sigma}) regret for realizable K-wise linear classification. We extend our results to settings where the true classifier is linear in an over-parameterized polynomial featurization of the contexts, as well as to a realizable piecewise-regression setting assuming access to an appropriate ERM oracle. Somewhat surprisingly, standard disagreement-based analyses are insufficient to achieve regret logarithmic in 1/{\sigma}. Instead, we develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers. Along the way, we develop numerous technical tools of independent interest, including a general anti-concentration bound for the determinant of certain matrix averages.
    Matryoshka Representations for Adaptive Deployment. (arXiv:2205.13147v1 [cs.LG])
    Learned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context rigid, fixed capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14x smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14x real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities -- vision (ViT, ResNet), vision + language (ALIGN) and language (BERT). MRL code and pretrained models are open-sourced at https://github.com/RAIVNLab/MRL.
    Mitigating Memorization of Noisy Labels via Regularization between Representations. (arXiv:2110.09022v3 [cs.LG] UPDATED)
    Designing robust loss functions is popular in learning with noisy labels while existing designs did not explicitly consider the overfitting property of deep neural networks (DNNs). As a result, applying these losses may still suffer from overfitting/memorizing noisy labels as training proceeds. In this paper, we first theoretically analyze the memorization effect and show that a lower-capacity model may perform better on noisy datasets. However, it is non-trivial to design a neural network with the best capacity given an arbitrary task. To circumvent this dilemma, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of a DNN by a representation regularizer. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. Our proposed framework is easily extendable and can incorporate many other robust loss functions to further improve performance. Extensive experiments and theoretical analyses support our claims. Code is available at github.com/UCSC-REAL/SelfSup_NoisyLabel.
    Joint Synthesis of Safety Certificate and Safe Control Policy using Constrained Reinforcement Learning. (arXiv:2111.07695v3 [cs.LG] UPDATED)
    Safety is the major consideration in controlling complex dynamical systems using reinforcement learning (RL), where the safety certificate can provide provable safety guarantee. A valid safety certificate is an energy function indicating that safe states are with low energy, and there exists a corresponding safe control policy that allows the energy function to always dissipate. The safety certificate and the safe control policy are closely related to each other and both challenging to synthesize. Therefore, existing learning-based studies treat either of them as prior knowledge to learn the other, which limits their applicability with general unknown dynamics. This paper proposes a novel approach that simultaneously synthesizes the energy-function-based safety certificate and learns the safe control policy with CRL. We do not rely on prior knowledge about either an available model-based controller or a perfect safety certificate. In particular, we formulate a loss function to optimize the safety certificate parameters by minimizing the occurrence of energy increases. By adding this optimization procedure as an outer loop to the Lagrangian-based constrained reinforcement learning (CRL), we jointly update the policy and safety certificate parameters and prove that they will converge to their respective local optima, the optimal safe policy and a valid safety certificate. We evaluate our algorithms on multiple safety-critical benchmark environments. The results show that the proposed algorithm learns provably safe policies with no constraint violation. The validity or feasibility of synthesized safety certificate is also verified numerically.
    GraphPMU: Event Clustering via Graph Representation Learning Using Locationally-Scarce Distribution-Level Fundamental and Harmonic PMU Measurements. (arXiv:2205.13116v1 [cs.LG])
    This paper is concerned with the complex task of identifying the type and cause of the events that are captured by distribution-level phasor measurement units (D-PMUs) in order to enhance situational awareness in power distribution systems. Our goal is to address two fundamental challenges in this field: a) scarcity in measurement locations due to the high cost of purchasing, installing, and streaming data from D-PMUs; b) limited prior knowledge about the event signatures due to the fact that the events are diverse, infrequent, and inherently unscheduled. To tackle these challenges, we propose an unsupervised graph-representation learning method, called GraphPMU, to significantly improve the performance in event clustering under locationally-scarce data availability by proposing the following two new directions: 1) using the topological information about the relative location of the few available phasor measurement units on the graph of the power distribution network; 2) utilizing not only the commonly used fundamental phasor measurements, bus also the less explored harmonic phasor measurements in the process of analyzing the signatures of various events. Through a detailed analysis of several case studies, we show that GraphPMU can highly outperform the prevalent methods in the literature.
    Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments. (arXiv:2205.13044v1 [cs.LG])
    We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound $\Omega((B_{\star} SAT_{\star}(\Delta_c + B_{\star}^2\Delta_P))^{1/3}K^{2/3})$, where $B_{\star}$ is the maximum expected cost of the optimal policy of any episode starting from any state, $T_{\star}$ is the maximum hitting time of the optimal policy of any episode starting from the initial state, $SA$ is the number of state-action pairs, $\Delta_c$ and $\Delta_P$ are the amount of changes of the cost and transition functions respectively, and $K$ is the number of episodes. The different roles of $\Delta_c$ and $\Delta_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of $\Delta_c$ and $\Delta_P$, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when $\Delta_c$ and $\Delta_P$ are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve $\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(\Delta_c+B_{\star}\Delta_P))^{1/3}K^{2/3}\})$ regret, where $L$ is the unknown number of changes of the environment.
    Understanding Metrics for Paraphrasing. (arXiv:2205.13119v1 [cs.CL])
    Paraphrase generation is a difficult problem. This is not only because of the limitations in text generation capabilities but also due that to the lack of a proper definition of what qualifies as a paraphrase and corresponding metrics to measure how good it is. Metrics for evaluation of paraphrasing quality is an on going research problem. Most of the existing metrics in use having been borrowed from other tasks do not capture the complete essence of a good paraphrase, and often fail at borderline-cases. In this work, we propose a novel metric $ROUGE_P$ to measure the quality of paraphrases along the dimensions of adequacy, novelty and fluency. We also provide empirical evidence to show that the current natural language generation metrics are insufficient to measure these desired properties of a good paraphrase. We look at paraphrase model fine-tuning and generation from the lens of metrics to gain a deeper understanding of what it takes to generate and evaluate a good paraphrase.
    A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens. (arXiv:2107.07875v2 [stat.ML] UPDATED)
    A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition in which Q-shared fails. Leveraging properties from expansion-constrained ordinary least-squares, we give a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.
    Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model. (arXiv:2205.13085v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structural equation model. We generalize from the linear setting to the heteroscedastic noise model where $Y = m(X) + \varepsilon\sigma(X)$ with non-linear functions $m(X)$ and $\sigma(X)$ representing the conditional mean and mean absolute deviation, respectively. This model preserves identifiability but introduces non-trivial challenges that require a customized algorithm called Generalized Root Causal Inference (GRCI) to extract the error terms correctly. GRCI recovers patient-specific root causes more accurately than existing alternatives.
    Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. (arXiv:2205.13098v1 [cs.LG])
    The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.
    Cali3F: Calibrated Fast Fair Federated Recommendation System. (arXiv:2205.13121v1 [cs.IR])
    The increasingly stringent regulations on privacy protection have sparked interest in federated learning. As a distributed machine learning framework, it bridges isolated data islands by training a global model over devices while keeping data localized. Specific to recommendation systems, many federated recommendation algorithms have been proposed to realize the privacy-preserving collaborative recommendation. However, several constraints remain largely unexplored. One big concern is how to ensure fairness between participants of federated learning, that is, to maintain the uniformity of recommendation performance across devices. On the other hand, due to data heterogeneity and limited networks, additional challenges occur in the convergence speed. To address these problems, in this paper, we first propose a personalized federated recommendation system training algorithm to improve the recommendation performance fairness. Then we adopt a clustering-based aggregation method to accelerate the training process. Combining the two components, we proposed Cali3F, a calibrated fast and fair federated recommendation framework. Cali3F not only addresses the convergence problem by a within-cluster parameter sharing approach but also significantly boosts fairness by calibrating local models with the global model. We demonstrate the performance of Cali3F across standard benchmark datasets and explore the efficacy in comparison to traditional aggregation approaches.
    QGNN: Value Function Factorisation with Graph Neural Networks. (arXiv:2205.13005v1 [cs.LG])
    In multi-agent reinforcement learning, the use of a global objective is a powerful tool for incentivising cooperation. Unfortunately, it is not sample-efficient to train individual agents with a global reward, because it does not necessarily correlate with an agent's individual actions. This problem can be solved by factorising the global value function into local value functions. Early work in this domain performed factorisation by conditioning local value functions purely on local information. Recently, it has been shown that providing both local information and an encoding of the global state can promote cooperative behaviour. In this paper we propose QGNN, the first value factorisation method to use a graph neural network (GNN) based model. The multi-layer message passing architecture of QGNN provides more representational complexity than models in prior work, allowing it to produce a more effective factorisation. QGNN also introduces a permutation invariant mixer which is able to match the performance of other methods, even with significantly fewer parameters. We evaluate our method against several baselines, including QMIX-Att, GraphMIX, QMIX, VDN, and hybrid architectures. Our experiments include Starcraft, the standard benchmark for credit assignment; Estimate Game, a custom environment that explicitly models inter-agent dependencies; and Coalition Structure Generation, a foundational problem with real-world applications. The results show that QGNN outperforms state-of-the-art value factorisation baselines consistently.
    Urban Rhapsody: Large-scale exploration of urban soundscapes. (arXiv:2205.13064v1 [cs.CY])
    Noise is one of the primary quality-of-life issues in urban environments. In addition to annoyance, noise negatively impacts public health and educational performance. While low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions, the amount of data they produce and the complexity of these data pose significant analytical challenges. One way to address these challenges is through machine listening techniques, which are used to extract features in attempts to classify the source of noise and understand temporal patterns of a city's noise situation. However, the overwhelming number of noise sources in the urban environment and the scarcity of labeled data makes it nearly impossible to create classification models with large enough vocabularies that capture the true dynamism of urban soundscapes In this paper, we first identify a set of requirements in the yet unexplored domain of urban soundscape exploration. To satisfy the requirements and tackle the identified challenges, we propose Urban Rhapsody, a framework that combines state-of-the-art audio representation, machine learning, and visual analytics to allow users to interactively create classification models, understand noise patterns of a city, and quickly retrieve and label audio excerpts in order to create a large high-precision annotated database of urban sound recordings. We demonstrate the tool's utility through case studies performed by domain experts using data generated over the five-year deployment of a one-of-a-kind sensor network in New York City.
    Semi-supervised Drifted Stream Learning with Short Lookback. (arXiv:2205.13066v1 [cs.LG])
    In many scenarios, 1) data streams are generated in real time; 2) labeled data are expensive and only limited labels are available in the beginning; 3) real-world data is not always i.i.d. and data drift over time gradually; 4) the storage of historical streams is limited and model updating can only be achieved based on a very short lookback window. This learning setting limits the applicability and availability of many Machine Learning (ML) algorithms. We generalize the learning task under such setting as a semi-supervised drifted stream learning with short lookback problem (SDSL). SDSL imposes two under-addressed challenges on existing methods in semi-supervised learning, continuous learning, and domain adaptation: 1) robust pseudo-labeling under gradual shifts and 2) anti-forgetting adaptation with short lookback. To tackle these challenges, we propose a principled and generic generation-replay framework to solve SDSL. The framework is able to accomplish: 1) robust pseudo-labeling in the generation step; 2) anti-forgetting adaption in the replay step. To achieve robust pseudo-labeling, we develop a novel pseudo-label classification model to leverage supervised knowledge of previously labeled data, unsupervised knowledge of new data, and, structure knowledge of invariant label semantics. To achieve adaptive anti-forgetting model replay, we propose to view the anti-forgetting adaptation task as a flat region search problem. We propose a novel minimax game-based replay objective function to solve the flat region search problem and develop an effective optimization solver. Finally, we present extensive experiments to demonstrate our framework can effectively address the task of anti-forgetting learning in drifted streams with short lookback.
    Factorized Structured Regression for Large-Scale Varying Coefficient Models. (arXiv:2205.13080v1 [stat.ML])
    Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR's performance and interpretability on a large-scale behavioral study with smartphone user data.
    Transferable Adversarial Attack based on Integrated Gradients. (arXiv:2205.13152v1 [cs.LG])
    The vulnerability of deep neural networks to adversarial examples has drawn tremendous attention from the community. Three approaches, optimizing standard objective functions, exploiting attention maps, and smoothing decision surfaces, are commonly used to craft adversarial examples. By tightly integrating the three approaches, we propose a new and simple algorithm named Transferable Attack based on Integrated Gradients (TAIG) in this paper, which can find highly transferable adversarial examples for black-box attacks. Unlike previous methods using multiple computational terms or combining with other methods, TAIG integrates the three approaches into one single term. Two versions of TAIG that compute their integrated gradients on a straight-line path and a random piecewise linear path are studied. Both versions offer strong transferability and can seamlessly work together with the previous methods. Experimental results demonstrate that TAIG outperforms the state-of-the-art methods. The code will available at https://github.com/yihuang2016/TAIG
    Learning to Query Internet Text for Informing Reinforcement Learning Agents. (arXiv:2205.13079v1 [cs.LG])
    Generalization to out of distribution tasks in reinforcement learning is a challenging problem. One successful approach improves generalization by conditioning policies on task or environment descriptions that provide information about the current transition or reward functions. Previously, these descriptions were often expressed as generated or crowd sourced text. In this work, we begin to tackle the problem of extracting useful information from natural language found in the wild (e.g. internet forums, documentation, and wikis). These natural, pre-existing sources are especially challenging, noisy, and large and present novel challenges compared to previous approaches. We propose to address these challenges by training reinforcement learning agents to learn to query these sources as a human would, and we experiment with how and when an agent should query. To address the \textit{how}, we demonstrate that pretrained QA models perform well at executing zero-shot queries in our target domain. Using information retrieved by a QA model, we train an agent to learn \textit{when} it should execute queries. We show that our method correctly learns to execute queries to maximize reward in a reinforcement learning setting.
    TSEM: Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series. (arXiv:2205.13012v1 [cs.LG])
    Deep learning has become a one-size-fits-all solution for technical and business domains thanks to its flexibility and adaptability. It is implemented using opaque models, which unfortunately undermines the outcome trustworthiness. In order to have a better understanding of the behavior of a system, particularly one driven by time series, a look inside a deep learning model so-called posthoc eXplainable Artificial Intelligence (XAI) approaches, is important. There are two major types of XAI for time series data, namely model-agnostic and model-specific. Model-specific approach is considered in this work. While other approaches employ either Class Activation Mapping (CAM) or Attention Mechanism, we merge the two strategies into a single system, simply called the Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series (TSEM). TSEM combines the capabilities of RNN and CNN models in such a way that RNN hidden units are employed as attention weights for the CNN feature maps temporal axis. The result shows that TSEM outperforms XCM. It is similar to STAM in terms of accuracy, while also satisfying a number of interpretability criteria, including causality, fidelity, and spatiotemporality.
    Preference Dynamics Under Personalized Recommendations. (arXiv:2205.13026v1 [cs.LG])
    Many projects (both practical and academic) have designed algorithms to match users to content they will enjoy under the assumption that user's preferences and opinions do not change with the content they see. Evidence suggests that individuals' preferences are directly shaped by what content they see -- radicalization, rabbit holes, polarization, and boredom are all example phenomena of preferences affected by content. Polarization in particular can occur even in ecosystems with "mass media," where no personalization takes place, as recently explored in a natural model of preference dynamics by~\citet{hkazla2019geometric} and~\citet{gaitonde2021polarization}. If all users' preferences are drawn towards content they already like, or are repelled from content they already dislike, uniform consumption of media leads to a population of heterogeneous preferences converging towards only two poles. In this work, we explore whether some phenomenon akin to polarization occurs when users receive \emph{personalized} content recommendations. We use a similar model of preference dynamics, where an individual's preferences move towards content the consume and enjoy, and away from content they consume and dislike. We show that standard user reward maximization is an almost trivial goal in such an environment (a large class of simple algorithms will achieve only constant regret). A more interesting objective, then, is to understand under what conditions a recommendation algorithm can ensure stationarity of user's preferences. We show how to design a content recommendations which can achieve approximate stationarity, under mild conditions on the set of available content, when a user's preferences are known, and how one can learn enough about a user's preferences to implement such a strategy even when user preferences are initially unknown.
    Trainable Weight Averaging for Fast Convergence and Better Generalization. (arXiv:2205.13104v1 [cs.LG])
    Stochastic gradient descent (SGD) and its variants are commonly considered as the de-facto methods to train deep neural networks (DNNs). While recent improvements to SGD mainly focus on the descent algorithm itself, few works pay attention to utilizing the historical solutions -- as an iterative method, SGD has actually gone through substantial explorations before its final convergence. Recently, an interesting attempt is stochastic weight averaging (SWA), which significantly improves the generalization by simply averaging the solutions at the tail stage of training. In this paper, we propose to optimize the averaging coefficients, leading to our Trainable Weight Averaging (TWA), essentially a novel training method in a reduced subspace spanned by historical solutions. TWA is quite efficient and has good generalization capability as the degree of freedom for training is small. It largely reduces the estimation error from SWA, making it not only further improve the SWA solutions but also take full advantage of the solutions generated in the head of training where SWA fails. In the extensive numerical experiments, (i) TWA achieves consistent improvements over SWA with less sensitivity to learning rate; (ii) applying TWA in the head stage of training largely speeds up the convergence, resulting in over 40% time saving on CIFAR and 30% on ImageNet with improved generalization compared with regular training. The code is released at https://github.com/nblt/TWA.
    Designing an Efficient End-to-end Machine Learning Pipeline for Real-time Empty-shelf Detection. (arXiv:2205.13060v1 [cs.LG])
    On-Shelf Availability (OSA) of products in retail stores is a critical business criterion in the fast moving consumer goods and retails sector. When a product is out-of-stock (OOS) and a customer cannot find it on its designed shelf, this causes a negative impact on the customer's behaviors and future demands. Several methods are being adopted by retailers today to detect empty shelves and ensure high OSA of products; however, such methods are generally ineffective and infeasible since they are either manual, expensive or less accurate. Recently machine learning based solutions have been proposed, but they suffer from high computation cost and low accuracy problem due to lack of large annotated datasets of on-shelf products. Here, we present an elegant approach for designing an end-to-end machine learning (ML) pipeline for real-time empty shelf detection. Considering the strong dependency between the quality of ML models and the quality of data, we focus on the importance of proper data collection, cleaning and correct data annotation before delving into modeling. Since an empty-shelf detection solution should be computationally-efficient for real-time predictions, we explore different run-time optimizations to improve the model performance. Our dataset contains 1000 images, collected and annotated by following well-defined guidelines. Our low-latency model achieves a mean average F1-score of 68.5%, and can process up to 67 images/s on Intel Xeon Gold and up to 860 images/s on an A100 GPU. Our annotated dataset is publicly available along with our optimized models.
    BRIGHT -- Graph Neural Networks in Real-Time Fraud Detection. (arXiv:2205.13084v1 [cs.LG])
    Detecting fraudulent transactions is an essential component to control risk in e-commerce marketplaces. Apart from rule-based and machine learning filters that are already deployed in production, we want to enable efficient real-time inference with graph neural networks (GNNs), which is useful to catch multihop risk propagation in a transaction graph. However, two challenges arise in the implementation of GNNs in production. First, future information in a dynamic graph should not be considered in message passing to predict the past. Second, the latency of graph query and GNN model inference is usually up to hundreds of milliseconds, which is costly for some critical online services. To tackle these challenges, we propose a Batch and Real-time Inception GrapH Topology (BRIGHT) framework to conduct an end-to-end GNN learning that allows efficient online real-time inference. BRIGHT framework consists of a graph transformation module (Two-Stage Directed Graph) and a corresponding GNN architecture (Lambda Neural Network). The Two-Stage Directed Graph guarantees that the information passed through neighbors is only from the historical payment transactions. It consists of two subgraphs representing historical relationships and real-time links, respectively. The Lambda Neural Network decouples inference into two stages: batch inference of entity embeddings and real-time inference of transaction prediction. Our experiments show that BRIGHT outperforms the baseline models by >2\% in average w.r.t.~precision. Furthermore, BRIGHT is computationally efficient for real-time fraud detection. Regarding end-to-end performance (including neighbor query and inference), BRIGHT can reduce the P99 latency by >75\%. For the inference stage, our speedup is on average 7.8$\times$ compared to the traditional GNN.
    EvoVGM: A Deep Variational Generative Model for Evolutionary Parameter Estimation. (arXiv:2205.13034v1 [cs.LG])
    Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model that jointly approximates the true posterior of local biological evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69 and GTR. We train the model via a low-variance variational objective function and a gradient ascent algorithm. Here, we show the consistency and effectiveness of the method on synthetic sequence alignments simulated with several evolutionary scenarios and on a real virus sequence alignment.
    Green Hierarchical Vision Transformer for Masked Image Modeling. (arXiv:2205.13515v1 [cs.CV])
    We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.
    BiT: Robustly Binarized Multi-distilled Transformer. (arXiv:2205.13016v1 [cs.LG])
    Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%.
    People counting system for retail analytics using edge AI. (arXiv:2205.13020v1 [cs.LG])
    Developments in IoT applications are playing an important role in our day-to-day life, starting from business predictions to self driving cars. One of the area, most influenced by the field of AI and IoT is retail analytics. In Retail Analytics, Conversion Rates - a metric which is most often used by retail stores to measure how many people have visited the store and how many purchases has happened. This retail conversion rate assess the marketing operations, increasing stock, store outlet and running promotions ..etc. Our project intends to build a cost-effective people counting system with AI at Edge, where it calculates Conversion rates using total number of people counted by the system and number of transactions for the day, which helps in providing analytical insights for retail store optimization with a very minimum hardware requirements.
    Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification. (arXiv:2205.13094v1 [cs.LG])
    While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. While in the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory the test accuracy of robust neural network classifiers is constrained by the number of minority samples.
    Improving Subgraph Representation Learning via Multi-View Augmentation. (arXiv:2205.13038v1 [cs.LG])
    Subgraph representation learning based on Graph Neural Network (GNN) has broad applications in chemistry and biology, such as molecule property prediction and gene collaborative function prediction. On the other hand, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks but are rarely explored in the GNN-based subgraph representation learning literature. In this work, we developed a novel multiview augmentation mechanism to improve subgraph representation learning and thus the accuracy of downstream prediction tasks. The augmentation technique creates multiple variants of subgraphs and embeds these variants into the original graph to achieve both high training efficiency, scalability, and improved accuracy. Experiments on several real-world subgraph benchmarks demonstrate the superiority of our proposed multi-view augmentation techniques.
    Online Deep Equilibrium Learning for Regularization by Denoising. (arXiv:2205.13051v1 [eess.IV])
    Plug-and-Play Priors (PnP) and Regularization by Denoising (RED) are widely-used frameworks for solving imaging inverse problems by computing fixed-points of operators combining physical measurement models and learned image priors. While traditional PnP/RED formulations have focused on priors specified using image denoisers, there is a growing interest in learning PnP/RED priors that are end-to-end optimal. The recent Deep Equilibrium Models (DEQ) framework has enabled memory-efficient end-to-end learning of PnP/RED priors by implicitly differentiating through the fixed-point equations without storing intermediate activation values. However, the dependence of the computational/memory complexity of the measurement models in PnP/RED on the total number of measurements leaves DEQ impractical for many imaging applications. We propose ODER as a new strategy for improving the efficiency of DEQ through stochastic approximations of the measurement models. We theoretically analyze ODER giving insights into its convergence and ability to approximate the traditional DEQ approach. Our numerical results suggest the potential improvements in training/testing complexity due to ODER on three distinct imaging applications.
    Scalable and Low-Latency Federated Learning with Cooperative Mobile Edge Networking. (arXiv:2205.13054v1 [cs.DC])
    Federated learning (FL) enables collaborative model training without centralizing data. However, the traditional FL framework is cloud-based and suffers from high communication latency. On the other hand, the edge-based FL framework that relies on an edge server co-located with access point for model aggregation has low communication latency but suffers from degraded model accuracy due to the limited coverage of edge server. In light of high-accuracy but high-latency cloud-based FL and low-latency but low-accuracy edge-based FL, this paper proposes a new FL framework based on cooperative mobile edge networking called cooperative federated edge learning (CFEL) to enable both high-accuracy and low-latency distributed intelligence at mobile edge networks. Considering the unique two-tier network architecture of CFEL, a novel federated optimization method dubbed cooperative edge-based federated averaging (CE-FedAvg) is further developed, wherein each edge server both coordinates collaborative model training among the devices within its own coverage and cooperates with other edge servers to learn a shared global model through decentralized consensus. Experimental results based on benchmark datasets show that CFEL can largely speed up the convergence speed and reduce the training time to achieve a target model accuracy compared with prior FL frameworks.
    Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality. (arXiv:2205.13521v1 [cs.AI])
    Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.
    Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation. (arXiv:2103.11802v4 [cs.LG] UPDATED)
    In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.
    Comparison of Traditional and Hybrid Time Series Models for Forecasting COVID-19 Cases. (arXiv:2105.03266v2 [cs.SI] UPDATED)
    Time series forecasting methods play critical role in estimating the spread of an epidemic. The coronavirus outbreak of December 2019 has already infected millions all over the world and continues to spread on. Just when the curve of the outbreak had started to flatten, many countries have again started to witness a rise in cases which is now being referred as the 2nd wave of the pandemic. A thorough analysis of time-series forecasting models is therefore required to equip state authorities and health officials with immediate strategies for future times. This aims of the study are three-fold: (a) To model the overall trend of the spread; (b) To generate a short-term forecast of 10 days in countries with the highest incidence of confirmed cases (USA, India and Brazil); (c) To quantitatively determine the algorithm that is best suited for precise modelling of the linear and non-linear features of the time series. The comparison of forecasting models for the total cumulative cases of each country is carried out by comparing the reported data and the predicted value, and then ranking the algorithms (Prophet, Holt-Winters, LSTM, ARIMA, and ARIMA-NARNN) based on their RMSE, MAE and MAPE values. The hybrid combination of ARIMA and NARNN (Nonlinear Auto-Regression Neural Network) gave the best result among the selected models with a reduced RMSE, which proved to be almost 35.3% better than one of the most prevalent method of time-series prediction (ARIMA). The results demonstrated the efficacy of the hybrid implementation of the ARIMA-NARNN model over other forecasting methods such as Prophet, Holt Winters, LSTM, and the ARIMA model in encapsulating the linear as well as non-linear patterns of the epidemical datasets.
    Kernel Ridgeless Regression is Inconsistent for Low Dimensions. (arXiv:2205.13525v1 [cs.LG])
    We show that kernel interpolation for a large class of shift-invariant kernels is inconsistent in fixed dimension, even with bandwidth adaptive to the training set.
    Continual evaluation for lifelong learning: Identifying the stability gap. (arXiv:2205.13452v1 [cs.LG])
    Introducing a time dependency on the data generating distribution has proven to be difficult for gradient-based training of neural networks, as the greedy updates result in catastrophic forgetting of previous timesteps. Continual learning aims to overcome the greedy optimization to enable continuous accumulation of knowledge over time. The data stream is typically divided into locally stationary distributions, called tasks, allowing task-based evaluation on held-out data from the training tasks. Contemporary evaluation protocols and metrics in continual learning are task-based and quantify the trade-off between stability and plasticity only at task transitions. However, our empirical evidence suggests that between task transitions significant, temporary forgetting can occur, remaining unidentified in task-based evaluation. Therefore, we propose a framework for continual evaluation that establishes per-iteration evaluation and define a new set of metrics that enables identifying the worst-case performance of the learner over its lifetime. Performing continual evaluation, we empirically identify that replay suffers from a stability gap: upon learning a new task, there is a substantial but transient decrease in performance on past tasks. Further conceptual and empirical analysis suggests not only replay-based, but also regularization-based continual learning methods are prone to the stability gap.
    Are Transformers Effective for Time Series Forecasting?. (arXiv:2205.13504v1 [cs.AI])
    Recently, there has been a surge of Transformer-based solutions for the time series forecasting (TSF) task, especially for the challenging long-term TSF problem. Transformer architecture relies on self-attention mechanisms to effectively extract the semantic correlations between paired elements in a long sequence, which is permutation-invariant and anti-ordering to some extent. However, in time series modeling, we are to extract the temporal relations among an ordering set of continuous points. Consequently, whether Transformer-based techniques are the right solutions for long-term time series forecasting is an interesting problem to investigate, despite the performance improvements shown in these studies. In this work, we question the validity of Transformer-based TSF solutions. In their experiments, the compared (non-Transformer) baselines are mainly autoregressive forecasting solutions, which usually have a poor long-term prediction capability due to inevitable error accumulation effects. In contrast, we use an embarrassingly simple architecture named DLinear that conducts direct multi-step (DMS) forecasting for comparison. DLinear decomposes the time series into a trend and a remainder series and employs two one-layer linear networks to model these two series for the forecasting task. Surprisingly, it outperforms existing complex Transformer-based models in most cases by a large margin. Therefore, we conclude that the relatively higher long-term forecasting accuracy of Transformer-based TSF solutions shown in existing works has little to do with the temporal relation extraction capabilities of the Transformer architecture. Instead, it is mainly due to the non-autoregressive DMS forecasting strategy used in them. We hope this study also advocates revisiting the validity of Transformer-based solutions for other time series analysis tasks (e.g., anomaly detection) in the future.
    Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency. (arXiv:2205.13476v1 [cs.LG])
    Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v1 [cs.LG])
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.
    Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback. (arXiv:2205.13451v1 [cs.LG])
    We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.
    The Neuro-Symbolic Brain. (arXiv:2205.13440v1 [cs.NE])
    Neural networks promote a distributed representation with no clear place for symbols. Despite this, we propose that symbols are manufactured simply by training a sparse random noise as a self-sustaining attractor in a feedback spiking neural network. This way, we can generate many of what we shall call prime attractors, and the networks that support them are like registers holding a symbolic value, and we call them registers. Like symbols, prime attractors are atomic and devoid of any internal structure. Moreover, the winner-take-all mechanism naturally implemented by spiking neurons enables registers to recover a prime attractor within a noisy signal. Using this faculty, when considering two connected registers, an input one and an output one, it is possible to bind in one shot using a Hebbian rule the attractor active on the output to the attractor active on the input. Thus, whenever an attractor is active on the input, it induces its bound attractor on the output; even though the signal gets blurrier with more bindings, the winner-take-all filtering faculty can recover the bound prime attractor. However, the capacity is still limited. It is also possible to unbind in one shot, restoring the capacity taken by that binding. This mechanism serves as a basis for working memory, turning prime attractors into variables. Also, we use a random second-order network to amalgamate the prime attractors held by two registers to bind the prime attractor held by a third register to them in one shot, de facto implementing a hash table. Furthermore, we introduce the register switch box composed of registers to move the content of one register to another. Then, we use spiking neurons to build a toy symbolic computer based on the above. The technics used suggest ways to design extrapolating, reusable, sample-efficient deep learning networks at the cost of structural priors.
    Principled Knowledge Extrapolation with GANs. (arXiv:2205.13444v1 [cs.LG])
    Human can extrapolate well, generalize daily knowledge into unseen scenarios, raise and answer counterfactual questions. To imitate this ability via generative models, previous works have extensively studied explicitly encoding Structural Causal Models (SCMs) into architectures of generator networks. This methodology, however, limits the flexibility of the generator as they must be carefully crafted to follow the causal graph, and demands a ground truth SCM with strong ignorability assumption as prior, which is a nontrivial assumption in many real scenarios. Thus, many current causal GAN methods fail to generate high fidelity counterfactual results as they cannot easily leverage state-of-the-art generative models. In this paper, we propose to study counterfactual synthesis from a new perspective of knowledge extrapolation, where a given knowledge dimension of the data distribution is extrapolated, but the remaining knowledge is kept indistinguishable from the original distribution. We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem, and a novel principal knowledge descent method can efficiently estimate the extrapolated distribution through the adversarial game. Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios.
    A framework for overparameterized learning. (arXiv:2205.13507v1 [cs.LG])
    An explanation for the success of deep neural networks is a central question in theoretical machine learning. According to classical statistical learning, the overparameterized nature of such models should imply a failure to generalize. Many argue that good empirical performance is due to the implicit regularization of first order optimization methods. In particular, the Polyak-{\L}ojasiewicz condition leads to gradient descent finding a global optimum that is close to initialization. In this work, we propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems and even the cases of infinitely wide neural networks and infinite data. We then perform an analysis from the perspective of the Polyak-{\L}ojasiewicz condition. We obtain theoretical results of independent interest, concerning gradient descent on a composition $(f \circ F): G \to \mathbb{R}$ of functions $F: G \to H$ and $f: H \to \mathbb{R}$ with $G, H$ being Hilbert spaces. Building on these results, we determine the properties that have to be satisfied by the components of the prototype problem for gradient descent to find a global optimum that is close to initialization. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem. Finally, we lay out a number of directions for future research.
    A Rotated Hyperbolic Wrapped Normal Distribution for Hierarchical Representation Learning. (arXiv:2205.13371v1 [cs.LG])
    We present a rotated hyperbolic wrapped normal distribution (RoWN), a simple yet effective alteration of a hyperbolic wrapped normal distribution (HWN). The HWN expands the domain of probabilistic modeling from Euclidean to hyperbolic space, where a tree can be embedded with arbitrary low distortion in theory. In this work, we analyze the geometric properties of the diagonal HWN, a standard choice of distribution in probabilistic modeling. The analysis shows that the distribution is inappropriate to represent the data points at the same hierarchy level through their angular distance with the same norm in the Poincar\'e disk model. We then empirically verify the presence of limitations of HWN, and show how RoWN, the newly proposed distribution, can alleviate the limitations on various hierarchical datasets, including noisy synthetic binary tree, WordNet, and Atari 2600 Breakout.
    Multi-fidelity power flow solver. (arXiv:2205.13362v1 [cs.LG])
    We propose a multi-fidelity neural network (MFNN) tailored for rapid high-dimensional grid power flow simulations and contingency analysis with scarce high-fidelity contingency data. The proposed model comprises two networks -- the first one trained on DC approximation as low-fidelity data and coupled to a high-fidelity neural net trained on both low- and high-fidelity power flow data. Each network features a latent module which parametrizes the model by a discrete grid topology vector for generalization (e.g., $n$ power lines with $k$ disconnections or contingencies, if any), and the targeted high-fidelity output is a weighted sum of linear and nonlinear functions. We tested the model on 14- and 118-bus test cases and evaluated its performance based on the $n-k$ power flow prediction accuracy with respect to imbalanced contingency data and high-to-low-fidelity sample ratio. The results presented herein demonstrate MFNN's potential and its limits with up to two orders of magnitude faster and more accurate power flow solutions than DC approximation.
    Constrained Reinforcement Learning for Short Video Recommendation. (arXiv:2205.13248v1 [cs.LG])
    The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users provide complex and multi-faceted responses towards recommendations, including watch time and various types of interactions with videos. As a result, established recommendation algorithms that concern a single objective are not adequate to meet this new demand of optimizing comprehensive user experiences. In this paper, we formulate the problem of short video recommendation as a constrained Markov Decision Process (MDP), where platforms want to optimize the main goal of user watch time in long term, with the constraint of accommodating the auxiliary responses of user interactions such as sharing/downloading videos. To solve the constrained MDP, we propose a two-stage reinforcement learning approach based on actor-critic framework. At stage one, we learn individual policies to optimize each auxiliary response. At stage two, we learn a policy to (i) optimize the main response and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive simulations, we demonstrate effectiveness of our approach over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our approach in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of watch time and interactions from video views. Our approach has been fully launched in the production system to optimize user experiences on the platform.
    Learning to Accelerate by the Methods of Step-size Planning. (arXiv:2204.01705v4 [cs.LG] UPDATED)
    Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{\mu/L}$, where $\mu, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures. (This is a shorter abstract than in the paper because of length requirement)
    DT+GNN: A Fully Explainable Graph Neural Network using Decision Trees. (arXiv:2205.13234v1 [cs.LG])
    We propose the fully explainable Decision Tree Graph Neural Network (DT+GNN) architecture. In contrast to existing black-box GNNs and post-hoc explanation methods, the reasoning of DT+GNN can be inspected at every step. To achieve this, we first construct a differentiable GNN layer, which uses a categorical state space for nodes and messages. This allows us to convert the trained MLPs in the GNN into decision trees. These trees are pruned using our newly proposed method to ensure they are small and easy to interpret. We can also use the decision trees to compute traditional explanations. We demonstrate on both real-world datasets and synthetic GNN explainability benchmarks that this architecture works as well as traditional GNNs. Furthermore, we leverage the explainability of DT+GNNs to find interesting insights into many of these datasets, with some surprising results. We also provide an interactive web tool to inspect DT+GNN's decision making.
    Friends to Help: Saving Federated Learning from Client Dropout. (arXiv:2205.13222v1 [cs.LG])
    Federated learning (FL) is an outstanding distributed machine learning framework due to its benefits on data privacy and communication efficiency. Since full client participation in many cases is infeasible due to constrained resources, partial participation FL algorithms have been investigated that proactively select/sample a subset of clients, aiming to achieve learning performance close to the full participation case. This paper studies a passive partial client participation scenario that is much less well understood, where partial participation is a result of external events, namely client dropout, rather than a decision of the FL algorithm. We cast FL with client dropout as a special case of a larger class of FL problems where clients can submit substitute (possibly inaccurate) local model updates. Based on our convergence analysis, we develop a new algorithm FL-FDMS that discovers friends of clients (i.e., clients whose data distributions are similar) on-the-fly and uses friends' local updates as substitutes for the dropout clients, thereby reducing the substitution error and improving the convergence performance. A complexity reduction mechanism is also incorporated into FL-FDMS, making it both theoretically sound and practically useful. Experiments on MNIST and CIFAR-10 confirmed the superior performance of FL-FDMS in handling client dropout in FL.
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v1 [cs.LG])
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers without supervised labels attained from the verified solver. This paper presents a novel training scheme, Sym-NCO, that achieves significant performance increments to existing DRL-NCO methods. Sym-NCO is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Imposing symmetricities such as rotational and reflectional invariance can greatly improve generalization capability of DRL-NCO as symmetricities are invariant features shared by certain CO tasks. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without employing problem-specific techniques. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 times faster speed.
    Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks. (arXiv:2205.13283v1 [cs.LG])
    Unraveling the general structure underlying the loss landscapes of deep neural networks (DNNs) is important for the theoretical study of deep learning. Inspired by the embedding principle of DNN loss landscape, we prove in this work an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. Specifically, we propose a critical lifting operator that any critical point of a shallower network can be lifted to a critical manifold of the target network while preserving the outputs. Through lifting, local minimum of an NN can become a strict saddle point of a deeper NN, which can be easily escaped by first-order methods. The embedding principle in depth reveals a large family of critical points in which layer linearization happens, i.e., computation of certain layers is effectively linear for the training inputs. We empirically demonstrate that, through suppressing layer linearization, batch normalization helps avoid the lifted critical manifolds, resulting in a faster decay of loss. We also demonstrate that increasing training data reduces the lifted critical manifold thus could accelerate the training. Overall, the embedding principle in depth well complements the embedding principle (in width), resulting in a complete characterization of the hierarchical structure of critical points/manifolds of a DNN loss landscape.
    Aggregating Gradients in Encoded Domain for Federated Learning. (arXiv:2205.13216v1 [cs.CR])
    Malicious attackers and an honest-but-curious server can steal private client data from uploaded gradients in federated learning. Although current protection methods (e.g., additive homomorphic cryptosystem) can guarantee the security of the federated learning system, they bring additional computation and communication costs. To mitigate the cost, we propose the \texttt{FedAGE} framework, which enables the server to aggregate gradients in an encoded domain without accessing raw gradients of any single client. Thus, \texttt{FedAGE} can prevent the curious server from gradient stealing while maintaining the same prediction performance without additional communication costs. Furthermore, we theoretically prove that the proposed encoding-decoding framework is a Gaussian mechanism for differential privacy. Finally, we evaluate \texttt{FedAGE} under several federated settings, and the results have demonstrated the efficacy of the proposed framework.
    Federated Non-negative Matrix Factorization for Short Texts Topic Modeling with Mutual Information. (arXiv:2205.13300v1 [cs.CL])
    Non-negative matrix factorization (NMF) based topic modeling is widely used in natural language processing (NLP) to uncover hidden topics of short text documents. Usually, training a high-quality topic model requires large amount of textual data. In many real-world scenarios, customer textual data should be private and sensitive, precluding uploading to data centers. This paper proposes a Federated NMF (FedNMF) framework, which allows multiple clients to collaboratively train a high-quality NMF based topic model with locally stored data. However, standard federated learning will significantly undermine the performance of topic models in downstream tasks (e.g., text classification) when the data distribution over clients is heterogeneous. To alleviate this issue, we further propose FedNMF+MI, which simultaneously maximizes the mutual information (MI) between the count features of local texts and their topic weight vectors to mitigate the performance degradation. Experimental results show that our FedNMF+MI methods outperform Federated Latent Dirichlet Allocation (FedLDA) and the FedNMF without MI methods for short texts by a significant margin on both coherence score and classification F1 score.
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v1 [cs.LG])
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features. For this problem, we propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that over $T$ rounds ($NT$ actions in total) the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is at most $\tilde{\mathcal{O}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. Remarkably, we derive an information-theoretic lower bound on the communication cost of the distributed contextual linear bandit problem with stochastic contexts, and prove that our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.
    Entropy Maximization with Depth: A Variational Principle for Random Neural Networks. (arXiv:2205.13076v1 [cs.LG])
    To understand the essential role of depth in neural networks, we investigate a variational principle for depth: Does increasing depth perform an implicit optimization for the representations in neural networks? We prove that random neural networks equipped with batch normalization maximize the differential entropy of representations with depth up to constant factors, assuming that the representations are contractive. Thus, representations inherently obey the \textit{principle of maximum entropy} at initialization, in the absence of information about the learning task. Our variational formulation for neural representations characterizes the interplay between representation entropy and architectural components, including depth, width, and non-linear activations, thereby potentially inspiring the design of neural architectures.
    Fast Vision Transformers with HiLo Attention. (arXiv:2205.13213v1 [cs.CV])
    Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group performs the attention to model the global relationship between the average-pooled low-frequency keys from each window and each query position in the input feature map. Benefit from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking on FLOPs, speed and memory consumption on GPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/zip-group/LITv2.
    More Recent Advances in (Hyper)Graph Partitioning. (arXiv:2205.13202v1 [cs.DS])
    In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also covering hypergraph partitioning and streaming algorithms, and has an additional focus on parallel algorithms.
    Unsupervised Learning From Incomplete Measurements for Inverse Problems. (arXiv:2201.12151v3 [stat.ML] UPDATED)
    In many real-world inverse problems, only incomplete measurement data are available for training which can pose a problem for learning a reconstruction function. Indeed, unsupervised learning using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the underlying signal model needed for reconstruction which indicate the interplay between the number of distinct measurement operators, the number of measurements per operator, the dimension of the model and the dimension of the signals. Furthermore, we propose a novel and conceptually simple unsupervised learning loss which only requires access to incomplete measurement data and achieves a performance on par with supervised learning when the sufficient condition is verified. We validate our theoretical bounds and demonstrate the advantages of the proposed unsupervised loss compared to previous methods via a series of experiments on various imaging inverse problems, such as accelerated magnetic resonance imaging, compressed sensing and image inpainting.
    Forecasting Patient Demand at Urgent Care Clinics using Machine Learning. (arXiv:2205.13067v1 [cs.LG])
    Urgent care clinics and emergency departments around the world periodically suffer from extended wait times beyond patient expectations due to inadequate staffing levels. These delays have been linked with adverse clinical outcomes. Previous research into forecasting demand this domain has mostly used a collection of statistical techniques, with machine learning approaches only now beginning to emerge in recent literature. The forecasting problem for this domain is difficult and has also been complicated by the COVID-19 pandemic which has introduced an additional complexity to this estimation due to typical demand patterns being disrupted. This study explores the ability of machine learning methods to generate accurate patient presentations at two large urgent care clinics located in Auckland, New Zealand. A number of machine learning algorithms were explored in order to determine the most effective technique for this problem domain, with the task of making forecasts of daily patient demand three months in advance. The study also performed an in-depth analysis into the model behaviour in respect to the exploration of which features are most effective at predicting demand and which features are capable of adaptation to the volatility caused by the COVID-19 pandemic lockdowns. The results showed that ensemble-based methods delivered the most accurate and consistent solutions on average, generating improvements in the range of 23%-27% over the existing in-house methods for estimating the daily demand.
    Tight Lower Bounds on Worst-Case Guarantees for Zero-Shot Learning with Attributes. (arXiv:2205.13068v1 [cs.LG])
    We develop a rigorous mathematical analysis of zero-shot learning with attributes. In this setting, the goal is to label novel classes with no training data, only detectors for attributes and a description of how those attributes are correlated with the target classes, called the class-attribute matrix. We develop the first non-trivial lower bound on the worst-case error of the best map from attributes to classes for this setting, even with perfect attribute detectors. The lower bound characterizes the theoretical intrinsic difficulty of the zero-shot problem based on the available information -- the class-attribute matrix -- and the bound is practically computable from it. Our lower bound is tight, as we show that we can always find a randomized map from attributes to classes whose expected error is upper bounded by the value of the lower bound. We show that our analysis can be predictive of how standard zero-shot methods behave in practice, including which classes will likely be confused with others.
    Grammar Detection for Sentiment Analysis through Improved Viterbi Algorithm. (arXiv:2205.13148v1 [cs.CL])
    Grammar Detection, also referred to as Parts of Speech Tagging of raw text, is considered an underlying building block of the various Natural Language Processing pipelines like named entity recognition, question answering, and sentiment analysis. In short, forgiven a sentence, Parts of Speech tagging is the task of specifying and tagging each word of a sentence with nouns, verbs, adjectives, adverbs, and more. Sentiment Analysis may well be a procedure accustomed to determining if a given sentence's emotional tone is neutral, positive or negative. To assign polarity scores to the thesis or entities within phrase, in-text analysis and analytics, machine learning and natural language processing, approaches are incorporated. This Sentiment Analysis using POS tagger helps us urge a summary of the broader public over a specific topic. For this, we are using the Viterbi algorithm, Hidden Markov Model, Constraint based Viterbi algorithm for POS tagging. By comparing the accuracies, we select the foremost accurate result of the model for Sentiment Analysis for determining the character of the sentence.
    Learning to segment with limited annotations: Self-supervised pretraining with regression and contrastive loss in MRI. (arXiv:2205.13109v1 [cs.CV])
    Obtaining manual annotations for large datasets for supervised training of deep learning (DL) models is challenging. The availability of large unlabeled datasets compared to labeled ones motivate the use of self-supervised pretraining to initialize DL models for subsequent segmentation tasks. In this work, we consider two pre-training approaches for driving a DL model to learn different representations using: a) regression loss that exploits spatial dependencies within an image and b) contrastive loss that exploits semantic similarity between pairs of images. The effect of pretraining techniques is evaluated in two downstream segmentation applications using Magnetic Resonance (MR) images: a) liver segmentation in abdominal T2-weighted MR images and b) prostate segmentation in T2-weighted MR images of the prostate. We observed that DL models pretrained using self-supervision can be finetuned for comparable performance with fewer labeled datasets. Additionally, we also observed that initializing the DL model using contrastive loss based pretraining performed better than the regression loss.
    Independent Asymmetric Embedding for Information Diffusion Prediction on Social Networks. (arXiv:2105.08291v6 [cs.LG] UPDATED)
    The prediction for information diffusion on social networks has great practical significance in marketing and public opinion control. It aims to predict the individuals who will potentially repost the message on the social network. One type of method is based on demographics, complex networks and other prior knowledge to establish an interpretable model to simulate and predict the propagation process, while the other type of method is completely data-driven and maps the nodes to a latent space for propagation prediction. Existing latent space design and embedding methods lack consideration for the intervene among users. In this paper, we propose an independent asymmetric embedding method to embed each individual into one latent influence space and multiple latent susceptibility spaces. Based on the similarity between information diffusion and heat diffusion phenomenon, the heat diffusion kernel is exploited in our model and establishes the embedding rules. Furthermore, our method captures the co-occurrence regulation of user combinations in cascades to improve the calculating effectiveness. The results of extensive experiments conducted on real-world datasets verify both the predictive accuracy and cost-effectiveness of our approach.
    Deep-XFCT: Deep learning 3D-mineral liberation analysis with micro X-ray fluorescence and computed tomography. (arXiv:2205.13102v1 [cs.LG])
    The rapid development of X-ray micro-computed tomography (micro-CT) opens new opportunities for 3D analysis of particle and grain-size characterisation, determination of particle densities and shape factors, estimation of mineral associations and liberation and locking. Current practices in mineral liberation analysis are based on 2D representations leading to systematic errors in the extrapolation to volumetric properties. New quantitative methods based on tomographic data are therefore urgently required for characterisation of mineral deposits, mineral processing, characterisation of tailings, rock typing, stratigraphic refinement, reservoir characterisation for applications in the resource industry, environmental and material sciences. To date, no simple non-destructive method exists for 3D mineral liberation analysis. We present a new development based on combining micro-CT with micro-X-ray fluorescence (micro-XRF) using deep learning. We demonstrate successful semi-automated multi-modal analysis of a crystalline magmatic rock where the new technique overcomes the difficult task of differentiating feldspar from quartz in micro-CT data set. The approach is universal and can be extended to any multi-modal and multi-instrument analysis for further refinement. We conclude that the combination of micro-CT and micro-XRF already provides a new opportunity for robust 3D mineral liberation analysis in both field and laboratory applications.
    Contextual Pandora's Box. (arXiv:2205.13114v1 [cs.LG])
    Pandora's Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate priors are given for the values of all the alternatives, while recent work studies the online variant of Pandora's Box where priors are originally unknown. In this work, we extend Pandora's Box to the online setting, while incorporating context. At every round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown prior distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well to the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is novel a modification of the realizability condition in contextual bandits that connects a context to the reservation value of the corresponding distribution rather than its mean
    Unsupervised Reinforcement Adaptation for Class-Imbalanced Text Classification. (arXiv:2205.13139v1 [cs.CL])
    Class imbalance naturally exists when train and test models in different domains. Unsupervised domain adaptation (UDA) augments model performance with only accessible annotations from the source domain and unlabeled data from the target domain. However, existing state-of-the-art UDA models learn domain-invariant representations and evaluate primarily on class-balanced data across domains. In this work, we propose an unsupervised domain adaptation approach via reinforcement learning that jointly leverages feature variants and imbalanced labels across domains. We experiment with the text classification task for its easily accessible datasets and compare the proposed method with five baselines. Experiments on three datasets prove that our proposed method can effectively learn robust domain-invariant representations and successfully adapt text classifiers on imbalanced classes over domains. The code is available at https://github.com/woqingdoua/ImbalanceClass.
    Leveraging Dependency Grammar for Fine-Grained Offensive Language Detection using Graph Convolutional Networks. (arXiv:2205.13164v1 [cs.CL])
    The last few years have witnessed an exponential rise in the propagation of offensive text on social media. Identification of this text with high precision is crucial for the well-being of society. Most of the existing approaches tend to give high toxicity scores to innocuous statements (e.g., "I am a gay man"). These false positives result from over-generalization on the training data where specific terms in the statement may have been used in a pejorative sense (e.g., "gay"). Emphasis on such words alone can lead to discrimination against the classes these systems are designed to protect. In this paper, we address the problem of offensive language detection on Twitter, while also detecting the type and the target of the offence. We propose a novel approach called SyLSTM, which integrates syntactic features in the form of the dependency parse tree of a sentence and semantic features in the form of word embeddings into a deep learning architecture using a Graph Convolutional Network. Results show that the proposed approach significantly outperforms the state-of-the-art BERT model with orders of magnitude fewer number of parameters.
    Towards Green AI with tensor networks -- Sustainability and innovation enabled by efficient algorithms. (arXiv:2205.12961v1 [cs.LG])
    The current standard to compare the performance of AI algorithms is mainly based on one criterion: the model's accuracy. In this context, algorithms with a higher accuracy (or similar measures) are considered as better. To achieve new state-of-the-art results, algorithmic development is accompanied by an exponentially increasing amount of compute. While this has enabled AI research to achieve remarkable results, AI progress comes at a cost: it is unsustainable. In this paper, we present a promising tool for sustainable and thus Green AI: tensor networks (TNs). Being an established tool from multilinear algebra, TNs have the capability to improve efficiency without compromising accuracy. Since they can reduce compute significantly, we would like to highlight their potential for Green AI. We elaborate in both a kernel machine and deep learning setting how efficiency gains can be achieved with TNs. Furthermore, we argue that better algorithms should be evaluated in terms of both accuracy and efficiency. To that end, we discuss different efficiency criteria and analyze efficiency in an exemplifying experimental setting for kernel ridge regression. With this paper, we want to raise awareness about Green AI and showcase its positive impact on sustainability and AI research. Our key contribution is to demonstrate that TNs enable efficient algorithms and therefore contribute towards Green AI. In this sense, TNs pave the way for better algorithms in AI.
    Formalizing Preferences Over Runtime Distributions. (arXiv:2205.13028v1 [cs.AI])
    When trying to solve a computational problem we are often faced with a choice among algorithms that are all guaranteed to return the right answer but that differ in their runtime distributions (e.g., SAT solvers, sorting algorithms). This paper aims to lay theoretical foundations for such choices by formalizing preferences over runtime distributions. It might seem that we should simply prefer the algorithm that minimizes expected runtime. However, such preferences would be driven by exactly how slow our algorithm is on bad inputs, whereas in practice we are typically willing to cut off occasional, sufficiently long runs before they finish. We propose a principled alternative, taking a utility-theoretic approach to characterize the scoring functions that describe preferences over algorithms. These functions depend on the way our value for solving our problem decreases with time and on the distribution from which captimes are drawn. We describe examples of realistic utility functions and show how to leverage a maximum-entropy approach for modeling underspecified captime distributions. Finally, we show how to efficiently estimate an algorithm's expected utility from runtime samples.
    Uniform Generalization Bound on Time and Inverse Temperature for Gradient Descent Algorithm and its Application to Analysis of Simulated Annealing. (arXiv:2205.12959v1 [cs.LG])
    In this paper, we propose a novel uniform generalization bound on the time and inverse temperature for stochastic gradient Langevin dynamics (SGLD) in a non-convex setting. While previous works derive their generalization bounds by uniform stability, we use Rademacher complexity to make our generalization bound independent of the time and inverse temperature. Using Rademacher complexity, we can reduce the problem to derive a generalization bound on the whole space to that on a bounded region and therefore can remove the effect of the time and inverse temperature from our generalization bound. As an application of our generalization bound, an evaluation on the effectiveness of the simulated annealing in a non-convex setting is also described. For the sample size $n$ and time $s$, we derive evaluations with orders $\sqrt{n^{-1} \log (n+1)}$ and $|(\log)^4(s)|^{-1}$, respectively. Here, $(\log)^4$ denotes the $4$ times composition of the logarithmic function.
    Concurrent Neural Tree and Data Preprocessing AutoML for Image Classification. (arXiv:2205.13033v1 [cs.LG])
    Deep Neural Networks (DNN's) are a widely-used solution for a variety of machine learning problems. However, it is often necessary to invest a significant amount of a data scientist's time to pre-process input data, test different neural network architectures, and tune hyper-parameters for optimal performance. Automated machine learning (autoML) methods automatically search the architecture and hyper-parameter space for optimal neural networks. However, current state-of-the-art (SOTA) methods do not include traditional methods for manipulating input data as part of the algorithmic search space. We adapt the Evolutionary Multi-objective Algorithm Design Engine (EMADE), a multi-objective evolutionary search framework for traditional machine learning methods, to perform neural architecture search. We also integrate EMADE's signal processing and image processing primitives. These primitives allow EMADE to manipulate input data before ingestion into the simultaneously evolved DNN. We show that including these methods as part of the search space shows potential to provide benefits to performance on the CIFAR-10 image classification benchmark dataset.
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v1 [cs.LG])
    It is well-known that the worst-case minimax regret for sparse linear bandits is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of time steps (ignoring the dependency on sparsity). On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th time step. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime ($\sigma_t = \Omega(1)$) and the benign deterministic regimes ($\sigma_t = 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a ``black-box'' manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.
    Learning the spatio-temporal relationship between wind and significant wave height using deep learning. (arXiv:2205.13325v1 [stat.ML])
    Ocean wave climate has a significant impact on near-shore and off-shore human activities, and its characterisation can help in the design of ocean structures such as wave energy converters and sea dikes. Therefore, engineers need long time series of ocean wave parameters. Numerical models are a valuable source of ocean wave data; however, they are computationally expensive. Consequently, statistical and data-driven approaches have gained increasing interest in recent decades. This work investigates the spatio-temporal relationship between North Atlantic wind and significant wave height (Hs) at an off-shore location in the Bay of Biscay, using a two-stage deep learning model. The first step uses convolutional neural networks (CNNs) to extract the spatial features that contribute to Hs. Then, long short-term memory (LSTM) is used to learn the long-term temporal dependencies between wind and waves.
    Towards Symbolic Time Series Representation Improved by Kernel Density Estimators. (arXiv:2205.12960v1 [cs.LG])
    This paper deals with symbolic time series representation. It builds up on the popular mapping technique Symbolic Aggregate approXimation algorithm (SAX), which is extensively utilized in sequence classification, pattern mining, anomaly detection, time series indexing and other data mining tasks. However, the disadvantage of this method is, that it works reliably only for time series with Gaussian-like distribution. In our previous work we have proposed an improvement of SAX, called dwSAX, which can deal with Gaussian as well as non-Gaussian data distribution. Recently we have made further progress in our solution - edwSAX. Our goal was to optimally cover the information space by means of sufficient alphabet utilization; and to satisfy lower bounding criterion as tight as possible. We describe here our approach, including evaluation on commonly employed tasks such as time series reconstruction error and Euclidean distance lower bounding with promising improvements over SAX.  ( 2 min )
    QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality. (arXiv:2205.13045v1 [cs.AR])
    As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied bit precision or quantization levels, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements (PE) into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QADAM, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate future research on design space exploration and Pareto-efficiency of DNN accelerators for various design choices such as bit precision, PE type, scratchpad sizes of PEs, global buffer size, number of total PEs, and DNN configurations. Our results show that different bit precisions and PE types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. We also show that the proposed lightweight processing elements (LightPEs) consistently achieve Pareto-optimal results in terms of accuracy and hardware-efficiency. With the proposed framework, we show that LightPEs achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based design.
    RACE: A Reinforcement Learning Framework for Improved Adaptive Control of NoC Channel Buffers. (arXiv:2205.13130v1 [cs.AR])
    Network-on-chip (NoC) architectures rely on buffers to store flits to cope with contention for router resources during packet switching. Recently, reversible multi-function channel (RMC) buffers have been proposed to simultaneously reduce power and enable adaptive NoC buffering between adjacent routers. While adaptive buffering can improve NoC performance by maximizing buffer utilization, controlling the RMC buffer allocations requires a congestion-aware, scalable, and proactive policy. In this work, we present RACE, a novel reinforcement learning (RL) framework that utilizes better awareness of network congestion and a new reward metric ("falsefulls") to help guide the RL agent towards better RMC buffer control decisions. We show that RACE reduces NoC latency by up to 48.9%, and energy consumption by up to 47.1% against state-of-the-art NoC buffer control policies.
  • Open

    Ranking the information content of distance measures. (arXiv:2104.15079v2 [stat.ML] UPDATED)
    Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.  ( 2 min )
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v3 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations; a statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Cyberphysical systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm.  ( 2 min )
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v2 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.  ( 2 min )
    A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens. (arXiv:2107.07875v2 [stat.ML] UPDATED)
    A dynamic treatment regimen (DTR) is a set of decision rules to personalize treatments for an individual using their medical history. The Q-learning based Q-shared algorithm has been used to develop DTRs that involve decision rules shared across multiple stages of intervention. We show that the existing Q-shared algorithm can suffer from non-convergence due to the use of linear models in the Q-learning setup, and identify the condition in which Q-shared fails. Leveraging properties from expansion-constrained ordinary least-squares, we give a penalized Q-shared algorithm that not only converges in settings that violate the condition, but can outperform the original Q-shared algorithm even when the condition is satisfied. We give evidence for the proposed method in a real-world application and several synthetic simulations.  ( 2 min )
    Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations. (arXiv:2204.04773v2 [stat.ML] UPDATED)
    Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.  ( 2 min )
    Improved Fine-Tuning by Better Leveraging Pre-Training Data. (arXiv:2111.12292v2 [cs.CV] UPDATED)
    As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline.  ( 2 min )
    Amplitude Mean of Functional Data on $\mathbb{S}^2$. (arXiv:2107.13721v4 [stat.ML] UPDATED)
    Manifold-valued functional data analysis (FDA) recently becomes an active area of research motivated by the raising availability of trajectories or longitudinal data observed on non-linear manifolds. The challenges of analyzing such data come from many aspects, including infinite dimensionality and nonlinearity, as well as time-domain or phase variability. In this paper, we study the amplitude part of manifold-valued functions on $\mathbb{S}^2$, which is invariant to random time warping or re-parameterization. Utilizing the nice geometry of $\mathbb{S}^2$, we develop a set of efficient and accurate tools for temporal alignment of functions, geodesic computing, and sample mean calculation. At the heart of these tools, they rely on gradient descent algorithms with carefully derived gradients. We show the advantages of these newly developed tools over its competitors with extensive simulations and real data and demonstrate the importance of considering the amplitude part of functions instead of mixing it with phase variability in manifold-valued FDA.  ( 2 min )
    Lorentzian Fully Hyperbolic Generative Adversarial Network. (arXiv:2201.12825v2 [cs.LG] UPDATED)
    With the recent advance of deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. While a variety of hyperbolic neural network structures have been proposed, they mainly focus on discriminative tasks, and generative models in the hyperbolic space have scarcely been studied. In this work, we propose a hyperbolic generative adversarial network (GAN) within the Lorentz model for generating hyperbolic data. In addition to existing hyperbolic operations, we design novel hyperbolic layers to guarantee stable training. We first use synthetic data to show that our network is able to learn simple distribution in the hyperbolic space. Moreover, by virtue of an autoencoder, we construct a neural network model, named HAEGAN, for generating more complex data in the hyperbolic space. HAEGAN contains three parts: first, a hyperbolic autoencoder; second, a hyperbolic GAN for generating the latent embedding of the autoencoder; third, a generator that inherits the decoder from autoencoder and the generator from the GAN. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.  ( 2 min )
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v1 [q-bio.NC])
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.  ( 2 min )
    Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap. (arXiv:2205.13527v1 [stat.ML])
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.  ( 2 min )
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v3 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.  ( 2 min )
    Coherent Probabilistic Aggregate Queries on Long-horizon Forecasts. (arXiv:2111.03394v2 [cs.LG] UPDATED)
    Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic forecasting method that produces forecasts that are coherent in terms of base level and predicted aggregate statistics. We achieve the coherency between predicted base-level and aggregate statistics using a novel inference method based on KL-divergence that can be solved efficiently in closed form. We show that our method improves forecast performance across both base level and unseen aggregates post inference on real datasets ranging three diverse domains. (\href{https://github.com/pratham16cse/AggForecaster}{Project URL})  ( 2 min )
    The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks. (arXiv:2108.11489v2 [stat.ML] UPDATED)
    The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.  ( 2 min )
    Factor selection in screening experiments by aggregation over random models. (arXiv:2205.13497v1 [stat.ME])
    Screening experiments are useful for screening out a small number of truly important factors from a large number of potentially important factors. The Gauss-Dantzig Selector (GDS) is often the preferred analysis method for screening experiments. Just considering main-effects models can result in erroneous conclusions, but including interaction terms, even if restricted to two-factor interactions, increases the number of model terms dramatically and challenges the GDS analysis. We propose a new analysis method, called Gauss-Dantzig Selector Aggregation over Random Models (GDS-ARM), which performs a GDS analysis on multiple models that include only some randomly selected interactions. Results from these different analyses are then aggregated to identify the important factors. We discuss the proposed method, suggest choices for the tuning parameters, and study its performance on real and simulated data.  ( 2 min )
    Unsupervised Learning From Incomplete Measurements for Inverse Problems. (arXiv:2201.12151v3 [stat.ML] UPDATED)
    In many real-world inverse problems, only incomplete measurement data are available for training which can pose a problem for learning a reconstruction function. Indeed, unsupervised learning using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the underlying signal model needed for reconstruction which indicate the interplay between the number of distinct measurement operators, the number of measurements per operator, the dimension of the model and the dimension of the signals. Furthermore, we propose a novel and conceptually simple unsupervised learning loss which only requires access to incomplete measurement data and achieves a performance on par with supervised learning when the sufficient condition is verified. We validate our theoretical bounds and demonstrate the advantages of the proposed unsupervised loss compared to previous methods via a series of experiments on various imaging inverse problems, such as accelerated magnetic resonance imaging, compressed sensing and image inpainting.  ( 2 min )
    When Doubly Robust Methods Meet Machine Learning for Estimating Treatment Effects from Real-World Data: A Comparative Study. (arXiv:2204.10969v2 [stat.ME] UPDATED)
    Observational cohort studies are increasingly being used for comparative effectiveness research to assess the safety of therapeutics. Recently, various doubly robust methods have been proposed for average treatment effect estimation by combining the treatment model and the outcome model via different vehicles, such as matching, weighting, and regression. The key advantage of doubly robust estimators is that they require either the treatment model or the outcome model to be correctly specified to obtain a consistent estimator of average treatment effects, and therefore lead to a more accurate and often more precise inference. However, little work has been done to understand how doubly robust estimators differ due to their unique strategies of using the treatment and outcome models and how machine learning techniques can be combined with these estimators to boost their performance. Also, little has been understood about the challenges of covariates selection, overlapping of covariate distribution, and treatment effect heterogeneity on the performance of these doubly robust estimators. Here we examine multiple popular doubly robust methods in the categories of matching, weighting, or regression, and compare their performance using different treatment and outcome modeling via extensive simulations and a real-world application. We found that incorporating machine learning with doubly robust estimators such as the targeted maximum likelihood estimator gives the best overall performance. Practical guidance on how to apply doubly robust estimators is provided.  ( 2 min )
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v1 [cs.LG])
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks. For example, we improve coverage on CIFAR-10/SVHN by 10.1%/1.5% respectively at a fixed target error of 0.5%.  ( 2 min )
    Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation. (arXiv:2103.11802v4 [cs.LG] UPDATED)
    In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.  ( 2 min )
    Training ReLU networks to high uniform accuracy is intractable. (arXiv:2205.13531v1 [cs.LG])
    Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. As a corollary we conclude that the training of ReLU neural networks to high uniform accuracy is intractable. In a security-critical context this points to the fact that deep learning based systems are prone to being fooled by a possible adversary. We corroborate our theoretical findings by numerical results.  ( 2 min )
    Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. (arXiv:2205.11508v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities\dots the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning to address those limitations. Through the course of this study, we will rigorously demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, Multidimensional Scaling et al. This unification will then allow us to obtain (i) the closed-form optimal representation for each method, (ii) the closed-form optimal network parameters in the linear regime for each method, (iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly, (iv) the first theoretical bridge between contrastive and non-contrastive methods towards global and local spectral embedding methods respectively, hinting at the benefits and limitations of each. For example, (i) if the pairwise relation is aligned with the downstream task, any SSL method can be employed successfully and will recover the supervised method, but in the low data regime, VICReg's invariance hyper-parameter should be high; (ii) if the pairwise relation is misaligned with the downstream task, VICReg with small invariance hyper-parameter should be preferred over SimCLR or BarlowTwins.
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v3 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces \textit{The Neural Testbed}: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.
    Censored Quantile Regression Neural Networks. (arXiv:2205.13496v1 [stat.ML])
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    A framework for overparameterized learning. (arXiv:2205.13507v1 [cs.LG])
    An explanation for the success of deep neural networks is a central question in theoretical machine learning. According to classical statistical learning, the overparameterized nature of such models should imply a failure to generalize. Many argue that good empirical performance is due to the implicit regularization of first order optimization methods. In particular, the Polyak-{\L}ojasiewicz condition leads to gradient descent finding a global optimum that is close to initialization. In this work, we propose a framework consisting of a prototype learning problem, which is general enough to cover many popular problems and even the cases of infinitely wide neural networks and infinite data. We then perform an analysis from the perspective of the Polyak-{\L}ojasiewicz condition. We obtain theoretical results of independent interest, concerning gradient descent on a composition $(f \circ F): G \to \mathbb{R}$ of functions $F: G \to H$ and $f: H \to \mathbb{R}$ with $G, H$ being Hilbert spaces. Building on these results, we determine the properties that have to be satisfied by the components of the prototype problem for gradient descent to find a global optimum that is close to initialization. We then demonstrate that supervised learning, variational autoencoders and training with gradient penalty can be translated to the prototype problem. Finally, we lay out a number of directions for future research.
    RKHS-SHAP: Shapley Values for Kernel Methods. (arXiv:2110.09167v2 [stat.ML] UPDATED)
    Feature attribution for kernel methods is often heuristic and not individualised for each prediction. To address this, we turn to the concept of Shapley values~(SV), a coalition game theoretical framework that has previously been applied to different machine learning model interpretation tasks, such as linear models, tree ensembles and deep networks. By analysing SVs from a functional perspective, we propose \textsc{RKHS-SHAP}, an attribution method for kernel machines that can efficiently compute both \emph{Interventional} and \emph{Observational Shapley values} using kernel mean embeddings of distributions. We show theoretically that our method is robust with respect to local perturbations - a key yet often overlooked desideratum for consistent model interpretation. Further, we propose \emph{Shapley regulariser}, applicable to a general empirical risk minimisation framework, allowing learning while controlling the level of specific feature's contributions to the model. We demonstrate that the Shapley regulariser enables learning which is robust to covariate shift of a given feature and fair learning which controls the SVs of sensitive features.
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v3 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are indeed able to satisfy the first two factors. Moreover, we empirically verify the third factor by conducting various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v4 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%
    Epistemic Neural Networks. (arXiv:2107.08924v3 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of \textit{joint} predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the \textit{epistemic neural network} (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the \textit{epinet}: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension. (arXiv:2205.13303v1 [stat.ML])
    While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.  ( 2 min )
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v1 [cs.LG])
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features. For this problem, we propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that over $T$ rounds ($NT$ actions in total) the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is at most $\tilde{\mathcal{O}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. Remarkably, we derive an information-theoretic lower bound on the communication cost of the distributed contextual linear bandit problem with stochastic contexts, and prove that our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.  ( 2 min )
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v1 [cs.LG])
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild. Our extensive experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.  ( 2 min )
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v1 [cs.LG])
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.  ( 2 min )
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v1 [cs.LG])
    It is well-known that the worst-case minimax regret for sparse linear bandits is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of time steps (ignoring the dependency on sparsity). On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th time step. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime ($\sigma_t = \Omega(1)$) and the benign deterministic regimes ($\sigma_t = 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a ``black-box'' manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.  ( 2 min )
    Uniform Generalization Bound on Time and Inverse Temperature for Gradient Descent Algorithm and its Application to Analysis of Simulated Annealing. (arXiv:2205.12959v1 [cs.LG])
    In this paper, we propose a novel uniform generalization bound on the time and inverse temperature for stochastic gradient Langevin dynamics (SGLD) in a non-convex setting. While previous works derive their generalization bounds by uniform stability, we use Rademacher complexity to make our generalization bound independent of the time and inverse temperature. Using Rademacher complexity, we can reduce the problem to derive a generalization bound on the whole space to that on a bounded region and therefore can remove the effect of the time and inverse temperature from our generalization bound. As an application of our generalization bound, an evaluation on the effectiveness of the simulated annealing in a non-convex setting is also described. For the sample size $n$ and time $s$, we derive evaluations with orders $\sqrt{n^{-1} \log (n+1)}$ and $|(\log)^4(s)|^{-1}$, respectively. Here, $(\log)^4$ denotes the $4$ times composition of the logarithmic function.  ( 2 min )
    Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback. (arXiv:2205.13451v1 [cs.LG])
    We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.  ( 2 min )
    Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency. (arXiv:2205.13476v1 [cs.LG])
    Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.  ( 2 min )
    Fair Representation Learning through Implicit Path Alignment. (arXiv:2205.13316v1 [cs.LG])
    We consider a fair representation learning perspective, where optimal predictors, on top of the data representation, are ensured to be invariant with respect to different sub-groups. Specifically, we formulate this intuition as a bi-level optimization, where the representation is learned in the outer-loop, and invariant optimal group predictors are updated in the inner-loop. Moreover, the proposed bi-level objective is demonstrated to fulfill the sufficiency rule, which is desirable in various practical scenarios but was not commonly studied in the fair learning. Besides, to avoid the high computational and memory cost of differentiating in the inner-loop of bi-level objective, we propose an implicit path alignment algorithm, which only relies on the solution of inner optimization and the implicit differentiation rather than the exact optimization path. We further analyze the error gap of the implicit approach and empirically validate the proposed method in both classification and regression settings. Experimental results show the consistently better trade-off in prediction performance and fairness measurement.  ( 2 min )
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v1 [cs.LG])
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers without supervised labels attained from the verified solver. This paper presents a novel training scheme, Sym-NCO, that achieves significant performance increments to existing DRL-NCO methods. Sym-NCO is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Imposing symmetricities such as rotational and reflectional invariance can greatly improve generalization capability of DRL-NCO as symmetricities are invariant features shared by certain CO tasks. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without employing problem-specific techniques. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 times faster speed.  ( 2 min )
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v1 [cs.LG])
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which generalizes active learning based on partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over number of samples. We illustrate our technique in depth for robust regression.  ( 2 min )
    Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model. (arXiv:2205.13085v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structural equation model. We generalize from the linear setting to the heteroscedastic noise model where $Y = m(X) + \varepsilon\sigma(X)$ with non-linear functions $m(X)$ and $\sigma(X)$ representing the conditional mean and mean absolute deviation, respectively. This model preserves identifiability but introduces non-trivial challenges that require a customized algorithm called Generalized Root Causal Inference (GRCI) to extract the error terms correctly. GRCI recovers patient-specific root causes more accurately than existing alternatives.  ( 2 min )
    Factorized Structured Regression for Large-Scale Varying Coefficient Models. (arXiv:2205.13080v1 [stat.ML])
    Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR's performance and interpretability on a large-scale behavioral study with smartphone user data.  ( 2 min )
    On Learning Mixture of Linear Regressions in the Non-Realizable Setting. (arXiv:2205.13166v1 [stat.ML])
    While mixture of linear regressions (MLR) is a well-studied topic, prior works usually do not analyze such models for prediction error. In fact, {\em prediction} and {\em loss} are not well-defined in the context of mixtures. In this paper, first we show that MLR can be used for prediction where instead of predicting a label, the model predicts a list of values (also known as {\em list-decoding}). The list size is equal to the number of components in the mixture, and the loss function is defined to be minimum among the losses resulted by all the component models. We show that with this definition, a solution of the empirical risk minimization (ERM) achieves small probability of prediction error. This begs for an algorithm to minimize the empirical risk for MLR, which is known to be computationally hard. Prior algorithmic works in MLR focus on the {\em realizable} setting, i.e., recovery of parameters when data is probabilistically generated by a mixed linear (noisy) model. In this paper we show that a version of the popular alternating minimization (AM) algorithm finds the best fit lines in a dataset even when a realizable model is not assumed, under some regularity conditions on the dataset and the initial points, and thereby provides a solution for the ERM. We further provide an algorithm that runs in polynomial time in the number of datapoints, and recovers a good approximation of the best fit lines. The two algorithms are experimentally compared.  ( 2 min )
    Learning the spatio-temporal relationship between wind and significant wave height using deep learning. (arXiv:2205.13325v1 [stat.ML])
    Ocean wave climate has a significant impact on near-shore and off-shore human activities, and its characterisation can help in the design of ocean structures such as wave energy converters and sea dikes. Therefore, engineers need long time series of ocean wave parameters. Numerical models are a valuable source of ocean wave data; however, they are computationally expensive. Consequently, statistical and data-driven approaches have gained increasing interest in recent decades. This work investigates the spatio-temporal relationship between North Atlantic wind and significant wave height (Hs) at an off-shore location in the Bay of Biscay, using a two-stage deep learning model. The first step uses convolutional neural networks (CNNs) to extract the spatial features that contribute to Hs. Then, long short-term memory (LSTM) is used to learn the long-term temporal dependencies between wind and waves.  ( 2 min )
    Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. (arXiv:2205.13098v1 [cs.LG])
    The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.  ( 2 min )
    Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions. (arXiv:2205.13056v1 [stat.ML])
    Due to the drastic gap in complexity between sequential and batch statistical learning, recent work has studied a smoothed sequential learning setting, where Nature is constrained to select contexts with density bounded by 1/{\sigma} with respect to a known measure {\mu}. Unfortunately, for some function classes, there is an exponential gap between the statistically optimal regret and that which can be achieved efficiently. In this paper, we give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/{\sigma}) regret for realizable K-wise linear classification. We extend our results to settings where the true classifier is linear in an over-parameterized polynomial featurization of the contexts, as well as to a realizable piecewise-regression setting assuming access to an appropriate ERM oracle. Somewhat surprisingly, standard disagreement-based analyses are insufficient to achieve regret logarithmic in 1/{\sigma}. Instead, we develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers. Along the way, we develop numerous technical tools of independent interest, including a general anti-concentration bound for the determinant of certain matrix averages.  ( 2 min )
    Classification ensembles for multivariate functional data with application to mouse movements in web surveys. (arXiv:2205.13380v1 [stat.ME])
    We propose new ensemble models for multivariate functional data classification as combinations of semi-metric-based weak learners. Our models extend current semi-metric-type methods from the univariate to the multivariate case, propose new semi-metrics to compute distances between functions, and consider more flexible options for combining weak learners using stacked generalisation methods. We apply these ensemble models to identify respondents' difficulty with survey questions, with the aim to improve survey data quality. As predictors of difficulty, we use mouse movement trajectories from the respondents' interaction with a web survey, in which several questions were manipulated to create two scenarios with different levels of difficulty.  ( 2 min )
    Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification. (arXiv:2205.13094v1 [cs.LG])
    While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. While in the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory the test accuracy of robust neural network classifiers is constrained by the number of minority samples.  ( 2 min )
    Preference Dynamics Under Personalized Recommendations. (arXiv:2205.13026v1 [cs.LG])
    Many projects (both practical and academic) have designed algorithms to match users to content they will enjoy under the assumption that user's preferences and opinions do not change with the content they see. Evidence suggests that individuals' preferences are directly shaped by what content they see -- radicalization, rabbit holes, polarization, and boredom are all example phenomena of preferences affected by content. Polarization in particular can occur even in ecosystems with "mass media," where no personalization takes place, as recently explored in a natural model of preference dynamics by~\citet{hkazla2019geometric} and~\citet{gaitonde2021polarization}. If all users' preferences are drawn towards content they already like, or are repelled from content they already dislike, uniform consumption of media leads to a population of heterogeneous preferences converging towards only two poles. In this work, we explore whether some phenomenon akin to polarization occurs when users receive \emph{personalized} content recommendations. We use a similar model of preference dynamics, where an individual's preferences move towards content the consume and enjoy, and away from content they consume and dislike. We show that standard user reward maximization is an almost trivial goal in such an environment (a large class of simple algorithms will achieve only constant regret). A more interesting objective, then, is to understand under what conditions a recommendation algorithm can ensure stationarity of user's preferences. We show how to design a content recommendations which can achieve approximate stationarity, under mild conditions on the set of available content, when a user's preferences are known, and how one can learn enough about a user's preferences to implement such a strategy even when user preferences are initially unknown.  ( 2 min )

  • Open

    [P] Looking for datasets containing slide decks / powerpoint presentations
    I'm currently working on a project that aims to generate summaries of slide decks / powerpoint presentations. I'm looking to augment the dataset that I have with some additional slides (hopefully but not necessarily paired with summaries), and wondered if anyone here has come across a dataset I could use. My initial thoughts were to use research paper abstracts paired with conference presentation slides. I was able to collect such a dataset from past ICML publications, however many of the slides are overly technical and lack natural language (e.g. presentations consisting of nothing but equations). Additionally these presentations are quite bit out of distribution with respect to the domain I'd be applying it to (business). So, my question: does anyone know of any publicly available datasets that are composed of slides/presentations, or of any other resources like conferences (particularly in business oriented fields, e.g., finance, marketing, business administration, operations management) that have easily accessible pairs of slide presentations and summaries (either paper abstracts or other forms)? Thanks in advance! submitted by /u/mamossa [link] [comments]  ( 1 min )
    [D] Understanding the fundamental maths vs just use existing Python libraries?
    I'm interested in developing trading algorithms (not bots) using machine learning techniques. I'm a very experienced software developer, I've worked in trading for many years and I am mathematical (couple of masters degrees in engineering- did Fourier Analysis, Laplace, autocorrelation, statistics etc) but haven't used it for years. Whilst looking for machine learning books there appear to be two approaches: Practical/use existing Python libraries (Hands on Machine Learning with Scikit......) Theory, understand the fundamental statistics (The Elements of Statistical Learning) Do I need to/what is the advantage of understanding the fundamental statistics? Is it possible to simply use existing ML libraries, configure parameters and run simulations? If I need to teach myself it, I will. But if there's very small benefit it's better I invest my time elsewhere. Any additional advice would be greatly appreciated. I'm not sure what holding-period yet (HFT or statistical arbitrage) but please feel free to distinguish if the answer differs. submitted by /u/reddit_faa7777 [link] [comments]  ( 2 min )
    [R] An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems - Google 2022 - Jeff Dean
    Paper: https://arxiv.org/abs/2205.12755 Abstract: "Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69image classification tasks, for example achieving the best test accuracy reported for a model trained only on public data for competitive tasks such as cifar10: 99.43%." https://www.youtube.com/watch?v=Pcin4hPGaOk https://preview.redd.it/ni93muz8fv191.jpg?width=1108&format=pjpg&auto=webp&s=1782b2100bcedd03db4443fa6c58ef7c4c904488 https://preview.redd.it/8txpqsgafv191.jpg?width=987&format=pjpg&auto=webp&s=f723e57e78f81ce38c19917af2db3352f8d3a47c https://preview.redd.it/qb0xj9safv191.jpg?width=1128&format=pjpg&auto=webp&s=792056a1ae7b72a36d6e429d92e13f0f11249e8e submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [D] How can we approach this problem ?
    You have a combined dataset consisting of 10 component datasets collected from 10 different sources. Independent models trained separately on each component dataset perform well on hold-out examples from that dataset. However, the aggregated model trained by combining the examples from all component datasets behaves weirdly. On hold-out examples from some component datasets, the aggregated model performs better than the independent models. On others, it performs worse than the independent models. During deployment, you expect to see input examples from these 10 component sources but also from many other sources which the model has not been trained on. What approach will you take to develop a model that will generalize well to examples from the seen and also the yet-unseen sources? submitted by /u/corporatededmeat [link] [comments]  ( 1 min )
    [R] New datasets for StyleGAN
    Hi all, The Author is here. TL;DR: We show how StyleGAN can be adapted to raw unaligned images collected from the Internet. New datasets and models are available. ​ How can we adapt StyleGAN to more complicated datasets? We have witnessed that a data-centric approach is the most effective. Raw image collections downloaded from the internet contain many outlier images and are characterized by a multi-modal distribution. Therefore, we perform automatic self-supervised filtering of the training data to remove the outliers. Our key idea is to use the generator itself for the filtering. In the second step, we employ a multi-modal variant of the StyleGAN truncation trick. This allows high quality generation while preserving the remarkable editing capabilities of StyleGAN. For more details and cool gifs, check our Project Page:https://self-distilled-stylegan.github.io/ Datasets and models: https://github.com/self-distilled-stylegan/self-distilled-internet-photos The datasets also can be directly downloaded: https://github.com/rmokady/SDIP_utils Demo for image generation: https://huggingface.co/spaces/hysts/Self-Distilled-StyleGAN ​ Feel free to ask anything that comes to your mind ​ Generated Dog Generated Elephant submitted by /u/RonMokady [link] [comments]  ( 1 min )
    [R] CNNs are Myopic
    submitted by /u/downtownslim [link] [comments]  ( 3 min )
    [R] Large Language Models are Zero-Shot Reasoners. My summary: Adding text such as "Let’s think step by step" to a prompt "elicits chain of thought from large language models across a variety of reasoning tasks".
    submitted by /u/Wiskkey [link] [comments]  ( 4 min )
    [D] Semantic Segmentation/Remote Sensing Challenges
    Does anyone know of any interesting Semantic Segmentation and/or Remote Sensing competition taking place this summer? Most of what I found ends in the next 1-2 weeks. submitted by /u/incognitoacnt [link] [comments]
    [News] New Anomaly Advisor tab on Netdata
    Hello everyone, I work on the development of a host monitoring system called Netdata, a free and open source software. I found it interesting to present here the new anomalies tab which shows high anomaly rates on all its nodes using machine learning. We made a post on our blog about it in more detail. https://www.netdata.cloud/blog/introducing-anomaly-advisor-unsupervised-anomaly-detection-in-netdata/ submitted by /u/roeant [link] [comments]
  • Open

    All ML and AI E-books for $10 Sale!
    submitted by /u/alimhabidi [link] [comments]
    How do leading AIs do on human IQ tests in 2022?
    Could not find an answer to this question online. submitted by /u/ShallowStroker [link] [comments]  ( 1 min )
    In this article, we show how to labels PDFs and scanned images using OCR in order to create your training dataset for your NLP application.
    submitted by /u/UBIAI [link] [comments]  ( 1 min )
    Corporate America bets on AI productivity
    submitted by /u/mr_j_b [link] [comments]
    Latest Artificial Intelligence Technologies — AI
    submitted by /u/mr_j_b [link] [comments]
    HPE is building a rapid AI supercomputer powered by the world’s largest CPU
    submitted by /u/mr_j_b [link] [comments]
    FUTURE TECH ASIA China and Europe are leading the push to regulate A.I. — one of them could set the global playbook
    submitted by /u/mr_j_b [link] [comments]  ( 1 min )
    Could AI Image-Generation technology like DALL-E and Imagen be the future of messaging?
    I feel like the application of technologies like DALL-E and Imagen would be perfect for everyday communication with friends and family, in the same way emojis and images and gifs enhance messaging today. Imagine an in-app messaging feature for iMessage or WhatsApp which allows you to type a prompt then scroll through various AI-generated images or gifs that fit the prompt to send to your friends and family rather than the emojis and gifs that most of us use now. Over recent years we have adapted to send emojis and images and gifs to express complex emotions and sentiments that would take lots of text to accomplish, but they’re still approximations because we are relying on an existing set of created images and videos. I imagine made-to-measure images and videos could be a type of casual communication in the future. submitted by /u/Psychadiculous [link] [comments]  ( 1 min )
    Microsoft AI Researchers Introduces (De)ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
    While there are several benefits to using Artificial Intelligence, there are also drawbacks to this cutting-edge technology. One example is the creation of inappropriate language by language models. Because these models are trained on massive amounts of data, inappropriate language may be learned due to its presence in the training data. In only specific cases, content moderation techniques can be used to flag or filter such language however, the datasets used to train these programs frequently fail to capture the complexity of potentially unsuitable and poisonous language, particularly hate speech. Furthermore, the neutral samples in these datasets seldom contain group references. As a result, tools may flag even neutral language that refers to a minority identification group as hate speech. A dataset needs to be created for training content moderation algorithms that may be used to detect better implicitly harmful material, inspired by big language models’ capacity to emulate the tone, style, and vocabulary of cues they receive, whether toxic or benign. Continue Reading | Check out the paper, Microsoft blog and Github codes https://preview.redd.it/zkushg4jdu191.png?width=1302&format=png&auto=webp&s=29bbf108b18539083d478ea718078cf3502962b7 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI-art isn't art
    submitted by /u/estasfuera [link] [comments]  ( 14 min )
    New framework for firms to ensure AI services are reliable, safe
    submitted by /u/kuang89 [link] [comments]  ( 1 min )
    AI Dream 53 - Cosmic Birth | 300 SUBS CELEBRATION
    submitted by /u/LordPewPew777 [link] [comments]
  • Open

    Help needed
    Please, does anyone have experience in using Gaussian processes to update q values instead of a deep learning network in an RL setting? I have separate codes for both but I need some help in merging them for a project. I will be happy to pay for the consultation if needed and feel free to give a referral if you know a friend or have access to some helpful resources. This RL is in an off-policy setting and for a static dataset. submitted by /u/Thin-Ad9581 [link] [comments]  ( 1 min )
    Classic control from pixels
    Hey all I'm trying to play around with off-policy learning from pixels/images. To start easy, I thought it would be a good idea to take a classic control environment like the Pendulum-v1 and try to solve it from pixels directly. To do this, I decided to wrap the gym env in a custom wrapper which seems to be working for all intents and purposes: class ImgObservationWrapper(Wrapper): def __init__(self, env): super(ImgObservationWrapper, self).__init__(env) self.reset() dummy_obs = self.process_img(env.render(mode="rgb_array")) self.observation_space = spaces.Box( low=0, high=255, shape=dummy_obs.shape, dtype=dummy_obs.dtype ) def reset(self, **kwargs): obs = self.env.reset(**kwargs) obs = self.process_img(self.env.render(mode="rgb_array")) return obs def process_img(self, observation, crop…  ( 1 min )
    Help with MADDPG on Food Collector (Unity ML-Agents)
    Can somebody help me with MADDPG? I am trying to apply it to Food Collector (Unity ML-Agents), and my model is constantly converging to action extremes (either -1 or 1) after only couple of steps. There are 5 agents, 40x40x5 obs and 3 continuous actions and 1 discrete. Buffer size is 10k, agent is taking random actions for first 10k steps. https://preview.redd.it/9gon27amsu191.png?width=725&format=png&auto=webp&s=ebe9f3552b9e39b2f480cb01ad35cd6e3ae21092 These are actions of agent#1, so you can see example of model quickly converging to -1 or 1 (in this case -1, -1, 1) ​ submitted by /u/TheGuy839 [link] [comments]  ( 1 min )
  • Open

    Breakthrough Google AI Text to Image Imagen Beats Dalle-2 With Unprecedented Photorealism and Deep Level of Language Understanding
    submitted by /u/getrich_or_diemining [link] [comments]
  • Open

    How AI Impact the future of Human Capital Management in 2022
    How AI Impact the future of Human Capital Management in 2022  ( 5 min )
  • Open

    A Devotion to Emotion: Hume AI’s Alan Cowen on the Intersection of AI and Empathy
    Can machines experience emotions? They might, according to Hume AI, an AI research lab and technology company that aims to “ensure artificial intelligence is built to serve human goals and emotional well-being.” So how can AI genuinely understand how we are feeling, and respond appropriately? On this episode of NVIDIA’s AI Podcast, host Noah Kravitz Read article > The post A Devotion to Emotion: Hume AI’s Alan Cowen on the Intersection of AI and Empathy appeared first on NVIDIA Blog.  ( 2 min )
    Ready, Set, Game: GFN Thursday Brings 10 New Titles to GeForce NOW
    It’s a beautiful day to play video games. And it’s GFN Thursday, which means we’ve got those games. Ten total titles join the GeForce NOW library of over 1,300 games, starting with the release of Roller Champions – a speedy, free-to-play roller skating title launching with competitive season 0. Rollin’ Into the Weekend Roll with Read article > The post Ready, Set, Game: GFN Thursday Brings 10 New Titles to GeForce NOW appeared first on NVIDIA Blog.  ( 2 min )
    Deciphering the Future: HPE Switches on AI Supercomputer in France
    Recalling the French linguist who deciphered the Rosetta Stone 150 years ago, Hewlett Packard Enterprise today switched on a tool to unravel its customers’ knottiest problems. The Champollion AI supercomputer takes its name from Jean-François Champollion (1790-1832), who decoded hieroglyphics that opened a door to study of ancient Egypt’s culture. Like Champollion, the mega-system resides Read article > The post Deciphering the Future: HPE Switches on AI Supercomputer in France appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Interactive Packaging: How to Make Packaging Smarter with AI and IoT
    Today, Packaging is not just about covering a product for a better sale. Interactive packaging is a new trend in the packaging industry which mainly focuses on customer satisfaction and engagement. BLE codes, AI, and IoT are key technologies in interactive packaging which is helping users with the product attributes and user instructions. Also, Incorporation… Read More »Interactive Packaging: How to Make Packaging Smarter with AI and IoT The post Interactive Packaging: How to Make Packaging Smarter with AI and IoT appeared first on Data Science Central.  ( 2 min )
    How to Choose the Best Salon CRM Software
    As you may know, everyone desires to look good or better than everyone else. To look pretty, cooler, smarter, and bold people go to the salons. Because a salon is a place that takes care of your outer beauty and makes you look prettier than before. They use different products on your body and which… Read More »How to Choose the Best Salon CRM Software The post How to Choose the Best Salon CRM Software appeared first on Data Science Central.  ( 5 min )
  • Open

    Deep interpretable ensembles. (arXiv:2205.12729v1 [stat.ML])
    Ensembles improve prediction performance and allow uncertainty quantification by aggregating predictions from multiple models. In deep ensembling, the individual models are usually black box neural networks, or recently, partially interpretable semi-structured deep transformation models. However, interpretability of the ensemble members is generally lost upon aggregation. This is a crucial drawback of deep ensembles in high-stake decision fields, in which interpretable models are desired. We propose a novel transformation ensemble which aggregates probabilistic predictions with the guarantee to preserve interpretability and yield uniformly better predictions than the ensemble members on average. Transformation ensembles are tailored towards interpretable deep transformation models but are applicable to a wider range of probabilistic neural networks. In experiments on several publicly available data sets, we demonstrate that transformation ensembles perform on par with classical deep ensembles in terms of prediction performance, discrimination, and calibration. In addition, we demonstrate how transformation ensembles quantify both aleatoric and epistemic uncertainty, and produce minimax optimal predictions under certain conditions.  ( 2 min )
    MAVIPER: Learning Decision Tree Policies for Interpretable Multi-Agent Reinforcement Learning. (arXiv:2205.12449v1 [cs.LG])
    Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable RL has shown promise in extracting more interpretable decision tree-based policies, but only in the single-agent setting. To fill this gap, we propose the first set of interpretable MARL algorithms that extract decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER can learn high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.  ( 2 min )
    ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data. (arXiv:2205.12600v1 [cs.CL])
    Large pretrained language models have been performing increasingly well in a variety of downstream tasks via prompting. However, it remains unclear from where the model learns the task-specific knowledge, especially in a zero-shot setup. In this work, we want to find evidence of the model's task-specific competence from pretraining and are specifically interested in locating a very small subset of pretraining data that directly supports the model in the task. We call such a subset supporting data evidence and propose a novel method ORCA to effectively identify it, by iteratively using gradient information related to the downstream task. This supporting data evidence offers interesting insights about the prompted language models: in the tasks of sentiment analysis and textual entailment, BERT shows a substantial reliance on BookCorpus, the smaller corpus of BERT's two pretraining corpora, as well as on pretraining examples that mask out synonyms to the task verbalizers.  ( 2 min )
    Online Metro Origin-Destination Prediction via Heterogeneous Information Aggregation. (arXiv:2107.00946v5 [cs.LG] UPDATED)
    Metro origin-destination prediction is a crucial yet challenging time-series analysis task in intelligent transportation systems, which aims to accurately forecast two specific types of cross-station ridership, i.e., Origin-Destination (OD) one and Destination-Origin (DO) one. However, complete OD matrices of previous time intervals can not be obtained immediately in online metro systems, and conventional methods only used limited information to forecast the future OD and DO ridership separately. In this work, we proposed a novel neural network module termed Heterogeneous Information Aggregation Machine (HIAM), which fully exploits heterogeneous information of historical data (e.g., incomplete OD matrices, unfinished order vectors, and DO matrices) to jointly learn the evolutionary patterns of OD and DO ridership. Specifically, an OD modeling branch estimates the potential destinations of unfinished orders explicitly to complement the information of incomplete OD matrices, while a DO modeling branch takes DO matrices as input to capture the spatial-temporal distribution of DO ridership. Moreover, a Dual Information Transformer is introduced to propagate the mutual information among OD features and DO features for modeling the OD-DO causality and correlation. Based on the proposed HIAM, we develop a unified Seq2Seq network to forecast the future OD and DO ridership simultaneously. Extensive experiments conducted on two large-scale benchmarks demonstrate the effectiveness of our method for online metro origin-destination prediction. Our code is resealed at https://github.com/HCPLab-SYSU/HIAM.  ( 2 min )
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v1 [stat.ML])
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is one of the most effective approaches to defend against such examples. We show that for linear regression problems, adversarial training can be formulated as a convex problem. This fact is then used to show that $\ell_\infty$-adversarial training produces sparse solutions and has many similarities to the lasso method. Similarly, $\ell_2$-adversarial training has similarities with ridge regression. We use a robust regression framework to analyze and understand these similarities and also point to some differences. Finally, we show how adversarial training behaves differently from other regularization methods when estimating overparameterized models (i.e., models with more parameters than datapoints). It minimizes a sum of three terms which regularizes the solution, but unlike lasso and ridge regression, it can sharply transition into an interpolation mode. We show that for sufficiently many features or sufficiently small regularization parameters, the learned model perfectly interpolates the training data while still exhibiting good out-of-sample performance.  ( 2 min )
    Differentially Private AUC Computation in Vertical Federated Learning. (arXiv:2205.12412v1 [cs.LG])
    Federated learning has gained great attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple parties. As a sub-category, vertical federated learning (vFL) focuses on the scenario where features and labels are split into different parties. The prior work on vFL has mostly studied how to protect label privacy during model training. However, model evaluation in vFL might also lead to potential leakage of private label information. One mitigation strategy is to apply label differential privacy (DP) but it gives bad estimations of the true (non-private) metrics. In this work, we propose two evaluation algorithms that can more accurately compute the widely used AUC (area under curve) metric when using label DP in vFL. Through extensive experiments, we show our algorithms can achieve more accurate AUCs compared to the baselines.  ( 2 min )
    Wavelet Feature Maps Compression for Image-to-Image CNNs. (arXiv:2205.12268v1 [cs.CV])
    Convolutional Neural Networks (CNNs) are known for requiring extensive computational resources, and quantization is among the best and most common methods for compressing them. While aggressive quantization (i.e., less than 4-bits) performs well for classification, it may cause severe performance degradation in image-to-image tasks such as semantic segmentation and depth estimation. In this paper, we propose Wavelet Compressed Convolution (WCC) -- a novel approach for high-resolution activation maps compression integrated with point-wise convolutions, which are the main computational cost of modern architectures. To this end, we use an efficient and hardware-friendly Haar-wavelet transform, known for its effectiveness in image compression, and define the convolution on the compressed activation map. We experiment on various tasks, that benefit from high-resolution input, and by combining WCC with light quantization, we achieve compression rates equivalent to 1-4bit activation quantization with relatively small and much more graceful degradation in performance.  ( 2 min )
    Giga-scale Kernel Matrix Vector Multiplication on GPU. (arXiv:2202.01085v2 [math.NA] UPDATED)
    Kernel matrix-vector multiplication (KMVM) is a foundational operation in machine learning and scientific computing. However, as KMVM tends to scale quadratically in both memory and time, applications are often limited by these computational constraints. In this paper, we propose a novel approximation procedure coined \textit{Faster-Fast and Free Memory Method} ($\text{F}^3$M) to address these scaling issues of KMVM for tall~($10^8\sim 10^9$) and skinny~($D\leq7$) data. Extensive experiments demonstrate that $\text{F}^3$M has empirical \emph{linear time and memory} complexity with a relative error of order $10^{-3}$ and can compute a full KMVM for a billion points \emph{in under a minute} on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods. We demonstrate the utility of our procedure by applying it as a drop-in for the state-of-the-art GPU-based linear solver FALKON, \emph{improving speed 1.5-5.5 times} at the cost of $<1\%$ drop in accuracy. We further demonstrate competitive results on \emph{Gaussian Process regression} coupled with significant speedups on a variety of real-world datasets.  ( 2 min )
    Sketch guided and progressive growing GAN for realistic and editable ultrasound image synthesis. (arXiv:2204.06929v3 [eess.IV] UPDATED)
    Ultrasound (US) imaging is widely used for anatomical structure inspection in clinical diagnosis. The training of new sonographers and deep learning based algorithms for US image analysis usually requires a large amount of data. However, obtaining and labeling large-scale US imaging data are not easy tasks, especially for diseases with low incidence. Realistic US image synthesis can alleviate this problem to a great extent. In this paper, we propose a generative adversarial network (GAN) based image synthesis framework. Our main contributions include: 1) we present the first work that can synthesize realistic B-mode US images with high-resolution and customized texture editing features; 2) to enhance structural details of generated images, we propose to introduce auxiliary sketch guidance into a conditional GAN. We superpose the edge sketch onto the object mask and use the composite mask as the network input; 3) to generate high-resolution US images, we adopt a progressive training strategy to gradually generate high-resolution images from low-resolution images. In addition, a feature loss is proposed to minimize the difference of high-level features between the generated and real images, which further improves the quality of generated images; 4) the proposed US image synthesis method is quite universal and can also be generalized to the US images of other anatomical structures besides the three ones tested in our study (lung, hip joint, and ovary); 5) extensive experiments on three large US image datasets are conducted to validate our method. Ablation studies, customized texture editing, user studies, and segmentation tests demonstrate promising results of our method in synthesizing realistic US images.  ( 3 min )
    TrustGNN: Graph Neural Network based Trust Evaluation via Learnable Propagative and Composable Nature. (arXiv:2205.12784v1 [cs.LG])
    Trust evaluation is critical for many applications such as cyber security, social communication and recommender systems. Users and trust relationships among them can be seen as a graph. Graph neural networks (GNNs) show their powerful ability for analyzing graph-structural data. Very recently, existing work attempted to introduce the attributes and asymmetry of edges into GNNs for trust evaluation, while failed to capture some essential properties (e.g., the propagative and composable nature) of trust graphs. In this work, we propose a new GNN based trust evaluation method named TrustGNN, which integrates smartly the propagative and composable nature of trust graphs into a GNN framework for better trust evaluation. Specifically, TrustGNN designs specific propagative patterns for different propagative processes of trust, and distinguishes the contribution of different propagative processes to create new trust. Thus, TrustGNN can learn comprehensive node embeddings and predict trust relationships based on these embeddings. Experiments on some widely-used real-world datasets indicate that TrustGNN significantly outperforms the state-of-the-art methods. We further perform analytical experiments to demonstrate the effectiveness of the key designs in TrustGNN.  ( 2 min )
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v3 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.  ( 2 min )
    Cross Domain Few-Shot Learning via Meta Adversarial Training. (arXiv:2202.05713v3 [cs.LG] UPDATED)
    Few-shot relation classification (RC) is one of the critical problems in machine learning. Current research merely focuses on the set-ups that both training and testing are from the same domain. However, in practice, this assumption is not always guaranteed. In this study, we present a novel model that takes into consideration the afore-mentioned cross-domain situation. Not like previous models, we only use the source domain data to train the prototypical networks and test the model on target domain data. A meta-based adversarial training framework (MBATF) is proposed to fine-tune the trained networks for adapting to data from the target domain. Empirical studies confirm the effectiveness of the proposed model.  ( 2 min )
    HEBO Pushing The Limits of Sample-Efficient Hyperparameter Optimisation. (arXiv:2012.03826v6 [cs.LG] UPDATED)
    In this work we rigorously analyse assumptions inherent to black-box optimisation hyper-parameter tuning tasks. Our results on the Bayesmark benchmark indicate that heteroscedasticity and non-stationarity pose significant challenges for black-box optimisers. Based on these findings, we propose a Heteroscedastic and Evolutionary Bayesian Optimisation solver (HEBO). HEBO performs non-linear input and output warping, admits exact marginal log-likelihood optimisation and is robust to the values of learned parameters. We demonstrate HEBO's empirical efficacy on the NeurIPS 2020 Black-Box Optimisation challenge, where HEBO placed first. Upon further analysis, we observe that HEBO significantly outperforms existing black-box optimisers on 108 machine learning hyperparameter tuning tasks comprising the Bayesmark benchmark. Our findings indicate that the majority of hyper-parameter tuning tasks exhibit heteroscedasticity and non-stationarity, multi-objective acquisition ensembles with Pareto front solutions improve queried configurations, and robust acquisition maximisers afford empirical advantages relative to their non-robust counterparts. We hope these findings may serve as guiding principles for practitioners of Bayesian optimisation. All code is made available at https://github.com/huawei-noah/HEBO.  ( 2 min )
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v1 [stat.ML])
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on public data, then using private data only for "transfer learning." In particular, we minimize the maximum mean discrepancy (MMD) between private target data and the generated distribution, using a kernel based on perceptual features from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images faithfully with $\varepsilon \approx 2$, far surpassing the current state of the art, which only models MNIST and FashionMNIST at $\varepsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.  ( 2 min )
    Deep Reinforcement Learning Guided Graph Neural Networks for Brain Network Analysis. (arXiv:2203.10093v2 [cs.LG] UPDATED)
    Modern neuroimaging techniques, such as diffusion tensor imaging (DTI) and functional magnetic resonance imaging (fMRI), enable us to model the human brain as a brain network or connectome. Capturing brain networks' structural information and hierarchical patterns is essential for understanding brain functions and disease states. Recently, the promising network representation learning capability of graph neural networks (GNNs) has prompted many GNN-based methods for brain network analysis to be proposed. Specifically, these methods apply feature aggregation and global pooling to convert brain network instances into meaningful low-dimensional representations used for downstream brain network analysis tasks. However, existing GNN-based methods often neglect that brain networks of different subjects may require various aggregation iterations and use GNN with a fixed number of layers to learn all brain networks. Therefore, how to fully release the potential of GNNs to promote brain network analysis is still non-trivial. To solve this problem, we propose a novel brain network representation framework, namely BN-GNN, which searches for the optimal GNN architecture for each brain network. Concretely, BN-GNN employs deep reinforcement learning (DRL) to train a meta-policy to automatically determine the optimal number of feature aggregations (reflected in the number of GNN layers) required for a given brain network. Extensive experiments on eight real-world brain network datasets demonstrate that our proposed BN-GNN improves the performance of traditional GNNs on different brain network analysis tasks.  ( 2 min )
    Robust Reinforcement Learning on Graphs for Logistics optimization. (arXiv:2205.12888v1 [cs.LG])
    Logistics optimization nowadays is becoming one of the hottest areas in the AI community. In the past year, significant advancements in the domain were achieved by representing the problem in a form of graph. Another promising area of research was to apply reinforcement learning algorithms to the above task. In our work, we made advantage of using both approaches and apply reinforcement learning on a graph. To do that, we have analyzed the most recent results in both fields and selected SOTA algorithms both from graph neural networks and reinforcement learning. Then, we combined selected models on the problem of AMOD systems optimization for the transportation network of New York city. Our team compared three algorithms - GAT, Pro-CNN and PTDNet - to bring to the fore the important nodes on a graph representation. Finally, we achieved SOTA results on AMOD systems optimization problem employing PTDNet with GNN and training them in reinforcement fashion. Keywords: Graph Neural Network (GNN), Logistics optimization, Reinforcement Learning  ( 2 min )
    Residual-Concatenate Neural Network with Deep Regularization Layers for Binary Classification. (arXiv:2205.12775v1 [cs.LG])
    Many complex Deep Learning models are used with different variations for various prognostication tasks. The higher learning parameters not necessarily ensure great accuracy. This can be solved by considering changes in very deep models with many regularization based techniques. In this paper we train a deep neural network that uses many regularization layers with residual and concatenation process for best fit with Polycystic Ovary Syndrome Diagnosis prognostication. The network was built with improvements from every step of failure to meet the needs of the data and achieves an accuracy of 99.3% seamlessly.  ( 2 min )
    Gradient-based explanations for Gaussian Process regression and classification models. (arXiv:2205.12797v1 [cs.LG])
    Gaussian Processes (GPs) have proven themselves as a reliable and effective method in probabilistic Machine Learning. Thanks to recent and current advances, modeling complex data with GPs is becoming more and more feasible. Thus, these types of models are, nowadays, an interesting alternative to Neural and Deep Learning methods, which are arguably the current state-of-the-art in Machine Learning. For the latter, we see an increasing interest in so-called explainable approaches - in essence methods that aim to make a Machine Learning model's decision process transparent to humans. Such methods are particularly needed when illogical or biased reasoning can lead to actual disadvantageous consequences for humans. Ideally, explainable Machine Learning should help detect such flaws in a model and aid a subsequent debugging process. One active line of research in Machine Learning explainability are gradient-based methods, which have been successfully applied to complex neural networks. Given that GPs are closed under differentiation, gradient-based explainability for GPs appears as a promising field of research. This paper is primarily focused on explaining GP classifiers via gradients where, contrary to GP regression, derivative GPs are not straightforward to obtain.  ( 2 min )
    Augmentation-induced Consistency Regularization for Classification. (arXiv:2205.12461v1 [cs.LG])
    Deep neural networks have become popular in many supervised learning tasks, but they may suffer from overfitting when the training dataset is limited. To mitigate this, many researchers use data augmentation, which is a widely used and effective method for increasing the variety of datasets. However, the randomness introduced by data augmentation causes inevitable inconsistency between training and inference, which leads to poor improvement. In this paper, we propose a consistency regularization framework based on data augmentation, called CR-Aug, which forces the output distributions of different sub models generated by data augmentation to be consistent with each other. Specifically, CR-Aug evaluates the discrepancy between the output distributions of two augmented versions of each sample, and it utilizes a stop-gradient operation to minimize the consistency loss. We implement CR-Aug to image and audio classification tasks and conduct extensive experiments to verify its effectiveness in improving the generalization ability of classifiers. Our CR-Aug framework is ready-to-use, it can be easily adapted to many state-of-the-art network architectures. Our empirical results show that CR-Aug outperforms baseline methods by a significant margin.  ( 2 min )
    Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit. (arXiv:1905.13654v11 [stat.ML] UPDATED)
    Recent work by Jacot et al. (2018) has shown that training a neural network using gradient descent in parameter space is related to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result by establishing that the output of a neural network trained using gradient descent can be approximated by a linear model when the network width is large. Indeed, under regularity conditions, the NTK converges to a time-independent kernel in the infinite-width limit. This regime is often called the NTK regime. In parallel, recent works on signal propagation (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019a) studied the impact of the initialization and the activation function on signal propagation in deep neural networks. In this paper, we connect these two theories by quantifying the impact of the initialization and the activation function on the NTK when the network depth becomes large. In particular, we provide a comprehensive analysis of the convergence rates of the NTK regime to the infinite depth regime.  ( 2 min )
    FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning. (arXiv:2205.07246v2 [cs.LG] UPDATED)
    Pseudo labeling and consistency regularization approaches based on confidence thresholding have made great progress in semi-supervised learning (SSL). However, we argue that existing methods might fail to adopt suitable thresholds since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to achieve some intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to define and adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty that encourages the model to produce diverse predictions during the early stages of training. Extensive experimental results indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively.
    A Neural Tangent Kernel Formula for Ensembles of Soft Trees with Arbitrary Architectures. (arXiv:2205.12904v1 [cs.LG])
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although it can have various tree architectures, the theoretical properties of their impact are not well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with infinitely many trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v1 [cs.LG])
    Learning causal structure poses a combinatorial search problem that typically involves evaluating structures using a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize the process of causal structure learning. Rather than searching over causal structures directly, we train a variational inference model to predict the causal structure from observational/interventional data. Our inference model acquires domain-specific inductive bias for causal discovery solely from data generated by a simulator. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Moreover, the architecture of our inference model is permutation invariant w.r.t. the data points and permutation equivariant w.r.t. the variables, facilitating generalization to significantly larger problem instances than seen during training. On synthetic data and semi-synthetic gene expression data, our models exhibit robust generalization capabilities under substantial distribution shift and significantly outperform existing algorithms, especially in the challenging genomics domain.
    Skill Machines: Temporal Logic Composition in Reinforcement Learning. (arXiv:2205.12532v1 [cs.LG])
    A major challenge in reinforcement learning is specifying tasks in a manner that is both interpretable and verifiable. One common approach is to specify tasks through reward machines -- finite state machines that encode the task to be solved. We introduce skill machines, a representation that can be learned directly from these reward machines that encode the solution to such tasks. We propose a framework where an agent first learns a set of base skills in a reward-free setting, and then combines these skills with the learned skill machine to produce composite behaviours specified by any regular language, such as linear temporal logics. This provides the agent with the ability to map from complex logical task specifications to near-optimal behaviours zero-shot. We demonstrate our approach in both a tabular and high-dimensional video game environment, where an agent is faced with several of these complex, long-horizon tasks. Our results indicate that the agent is capable of satisfying extremely complex task specifications, producing near optimal performance with no further learning. Finally, we demonstrate that the performance of skill machines can be improved with regular offline reinforcement learning algorithms when optimal behaviours are desired.
    Sharpness-Aware Minimization with Dynamic Reweighting. (arXiv:2112.08772v3 [cs.LG] UPDATED)
    Deep neural networks are often overparameterized and may not easily achieve model generalization. Adversarial training has shown effectiveness in improving generalization by regularizing the change of loss on top of adversarially chosen perturbations. The recently proposed sharpness-aware minimization (SAM) algorithm conducts adversarial weight perturbation, encouraging the model to converge to a flat minima. SAM finds a common adversarial weight perturbation per-batch. Although per-instance adversarial weight perturbations are stronger adversaries and they can potentially lead to better generalization performance, their computational cost is very high and thus it is impossible to use per-instance perturbations efficiently in SAM. In this paper, we tackle this efficiency bottleneck and propose sharpness-aware minimization with dynamic reweighting ({\delta}-SAM). Our theoretical analysis motivates that it is possible to approach the stronger, per-instance adversarial weight perturbations using reweighted per-batch weight perturbations. {\delta}-SAM dynamically reweights perturbation within each batch according to the theoretically principled weighting factors, serving as a good approximation to per-instance perturbation. Experiments on various natural language understanding tasks demonstrate the effectiveness of {\delta}-SAM.
    Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. (arXiv:2205.12685v1 [cs.CL])
    Despite recent explosion in research interests, in-context learning and the precise impact of the quality of demonstrations remain elusive. While, based on current literature, it is expected that in-context learning shares a similar mechanism to supervised learning, Min et al. (2022) recently reported that, surprisingly, input-label correspondence is less important than other aspects of prompt demonstrations. Inspired by this counter-intuitive observation, we re-examine the importance of ground truth labels on in-context learning from diverse and statistical points of view. With the aid of the newly introduced metrics, i.e., Ground-truth Label Effect Ratio (GLER), demo-gain, and label sensitivity, we find that the impact of the correct input-label matching can vary according to different configurations. Expanding upon the previous key finding on the role of demonstrations, the complementary and contrastive results suggest that one might need to take more care when estimating the impact of each component in in-context learning demonstrations.
    NeuralPDE: Modelling Dynamical Systems from Data. (arXiv:2111.07671v2 [cs.LG] UPDATED)
    Many physical processes such as weather phenomena or fluid mechanics are governed by partial differential equations (PDEs). Modelling such dynamical systems using Neural Networks is an active research field. However, current methods are still very limited, as they do not exploit the knowledge about the dynamical nature of the system, require extensive prior knowledge about the governing equations or are limited to linear or first-order equations. In this work we make the observation that the Method of Lines used to solve PDEs can be represented using convolutions which makes convolutional neural networks (CNNs) the natural choice to parametrize arbitrary PDE dynamics. We combine this parametrization with differentiable ODE solvers to form the NeuralPDE Model, which explicitly takes into account the fact that the data is governed by differential equations. We show in several experiments on toy and real-world data that our model consistently outperforms state-of-the-art models used to learn dynamical systems.
    Autoformalization with Large Language Models. (arXiv:2205.12615v1 [cs.LG])
    Autoformalization is the process of automatically translating from natural language mathematics to formal specifications and proofs. A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence. While the long-term goal of autoformalization seemed elusive for a long time, we show large language models provide new prospects towards this goal. We make the surprising observation that LLMs can correctly translate a significant portion ($25.3\%$) of mathematical competition problems perfectly to formal specifications in Isabelle/HOL. We demonstrate the usefulness of this process by improving a previously introduced neural theorem prover via training on these autoformalized theorems. Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6\%$ to $35.2\%$.
    Eliciting Transferability in Multi-task Learning with Task-level Mixture-of-Experts. (arXiv:2205.12701v1 [cs.CL])
    Recent work suggests that transformer models are capable of multi-task learning on diverse NLP tasks. However, the potential of these models may be limited as they use the same set of parameters for all tasks. In contrast, humans tackle tasks in a more flexible way, by making proper presumptions on what skills and knowledge are relevant and executing only the necessary computations. Inspired by this, we propose to use task-level mixture-of-expert models, which has a collection of transformer layers (i.e., experts) and a router component to choose among these experts dynamically and flexibly. We show that the learned routing decisions and experts partially rediscover human categorization of NLP tasks -- certain experts are strongly associated with extractive tasks, some with classification tasks, and some with tasks requiring world knowledge.
    muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems. (arXiv:2205.10937v2 [cs.LG] UPDATED)
    Most uses of machine learning today involve training a model from scratch for a particular task, or sometimes starting with a model pretrained on a related task and then fine-tuning on a downstream task. Both approaches offer limited knowledge transfer between different tasks, time-consuming human-driven customization to individual tasks and high computational costs especially when starting from randomly initialized models. We propose a method that uses the layers of a pretrained deep neural network as building blocks to construct an ML system that can jointly solve an arbitrary number of tasks. The resulting system can leverage cross tasks knowledge transfer, while being immune from common drawbacks of multitask approaches such as catastrophic forgetting, gradients interference and negative transfer. We define an evolutionary approach designed to jointly select the prior knowledge relevant for each task, choose the subset of the model parameters to train and dynamically auto-tune its hyperparameters. Furthermore, a novel scale control method is employed to achieve quality/size trade-offs that outperform common fine-tuning techniques. Compared with standard fine-tuning on a benchmark of 10 diverse image classification tasks, the proposed model improves the average accuracy by 2.39% while using 47% less parameters per task.
    Tell me why! Explanations support learning relational and causal structure. (arXiv:2112.03753v3 [cs.LG] UPDATED)
    Inferring the abstract relational and causal structure of the world is a major challenge for reinforcement-learning (RL) agents. For humans, language--particularly in the form of explanations--plays a considerable role in overcoming this challenge. Here, we show that language can play a similar role for deep RL agents in complex environments. While agents typically struggle to acquire relational and causal knowledge, augmenting their experience by training them to predict language descriptions and explanations can overcome these limitations. We show that language can help agents learn challenging relational tasks, and examine which aspects of language contribute to its benefits. We then show that explanations can help agents to infer not only relational but also causal structure. Language can shape the way that agents to generalize out-of-distribution from ambiguous, causally-confounded training, and explanations even allow agents to learn to perform experimental interventions to identify causal relationships. Our results suggest that language description and explanation may be powerful tools for improving agent learning and generalization.
    Removing the fat from your posterior samples with margarine. (arXiv:2205.12841v1 [astro-ph.IM])
    Bayesian workflows often require the introduction of nuisance parameters, yet for core science modelling one needs access to a marginal posterior density. In this work we use masked autoregressive flows and kernel density estimators to encapsulate the marginal posterior, allowing us to compute marginal Kullback-Leibler divergences and marginal Bayesian model dimensionalities in addition to generating samples and computing marginal log probabilities. We demonstrate this in application to topical cosmological examples of the Dark Energy Survey, and global 21cm signal experiments. In addition to the computation of marginal Bayesian statistics, this work is important for further applications in Bayesian experimental design, complex prior modelling and likelihood emulation. This technique is made publicly available in the pip-installable code margarine.
    Recipe for a General, Powerful, Scalable Graph Transformer. (arXiv:2205.12454v1 [cs.LG])
    We propose a recipe on how to build a general, powerful, scalable (GPS) graph Transformer with linear complexity and state-of-the-art results on a diverse set of benchmarks. Graph Transformers (GTs) have gained popularity in the field of graph representation learning with a variety of recent publications but they lack a common foundation about what constitutes a good positional or structural encoding, and what differentiates them. In this paper, we summarize the different types of encodings with a clearer definition and categorize them as being $\textit{local}$, $\textit{global}$ or $\textit{relative}$. Further, GTs remain constrained to small graphs with few hundred nodes, and we propose the first architecture with a complexity linear to the number of nodes and edges $O(N+E)$ by decoupling the local real-edge aggregation from the fully-connected Transformer. We argue that this decoupling does not negatively affect the expressivity, with our architecture being a universal function approximator for graphs. Our GPS recipe consists of choosing 3 main ingredients: (i) positional/structural encoding, (ii) local message-passing mechanism, and (iii) global attention mechanism. We build and open-source a modular framework $\textit{GraphGPS}$ that supports multiple types of encodings and that provides efficiency and scalability both in small and large graphs. We test our architecture on 11 benchmarks and show very competitive results on all of them, show-casing the empirical benefits gained by the modularity and the combination of different strategies.
    Training Heterogeneous Features in Sequence to Sequence Tasks: Latent Enhanced Multi-filter Seq2Seq Model. (arXiv:2105.08840v3 [cs.CL] UPDATED)
    In language processing, training data with extremely large variance may lead to difficulty in the language model's convergence. It is difficult for the network parameters to adapt sentences with largely varied semantics or grammatical structures. To resolve this problem, we introduce a model that concentrates the each of the heterogeneous features in the input sentences. Building upon the encoder-decoder architecture, we design a latent-enhanced multi-filter seq2seq model (LEMS) that analyzes the input representations by introducing a latent space transformation and clustering. The representations are extracted from the final hidden state of the encoder and lie in the latent space. A latent space transformation is applied for enhancing the quality of the representations. Thus the clustering algorithm can easily separate samples based on the features of these representations. Multiple filters are trained by the features from their corresponding clusters, and the heterogeneity of the training data can be resolved accordingly. We conduct two sets of comparative experiments on semantic parsing and machine translation, using the Geo-query dataset and Multi30k English-French to demonstrate the enhancement our model has made respectively.
    RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning. (arXiv:2205.12598v1 [cs.CL])
    Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in English natural language. While the progress is promising, it is currently unclear if these models indeed perform logical reasoning by understanding the underlying logical semantics in the language. To this end, we propose RobustLR, a suite of evaluation datasets that evaluate the robustness of these models to minimal logical edits in rulebases and some standard logical equivalence conditions. In our experiments with RoBERTa and T5, we find that the models trained in prior works do not perform consistently on the different perturbations in RobustLR, thus showing that the models are not robust to the proposed logical perturbations. Further, we find that the models find it especially hard to learn logical negation and disjunction operators. Overall, using our evaluation sets, we demonstrate some shortcomings of the deductive reasoning-based language models, which can eventually help towards designing better models for logical reasoning over natural language.
    Boosting Tail Neural Network for Realtime Custom Keyword Spotting. (arXiv:2205.12933v1 [eess.AS])
    In this paper, we propose a Boosting Tail Neural Network (BTNN) for improving the performance of Realtime Custom Keyword Spotting (RCKS) that is still an industrial challenge for demanding powerful classification ability with limited computation resources. Inspired by Brain Science that a brain is only partly activated for a nerve simulation and numerous machine learning algorithms are developed to use a batch of weak classifiers to resolve arduous problems, which are often proved to be effective. We show that this method is helpful to the RCKS problem. The proposed approach achieve better performances in terms of wakeup rate and false alarm. In our experiments compared with those traditional algorithms that use only one strong classifier, it gets 18\% relative improvement. We also point out that this approach may be promising in future ASR exploration.
    Learning Mean Field Games: A Survey. (arXiv:2205.12944v1 [cs.LG])
    Non-cooperative and cooperative games with a very large number of players have many applications but remain generally intractable when the number of players increases. Introduced by Lasry and Lions, and Huang, Caines and Malham\'e, Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. Traditional methods for solving these games generally rely on solving partial or stochastic differential equations with a full knowledge of the model. Recently, Reinforcement Learning (RL) has appeared promising to solve complex problems. By combining MFGs and RL, we hope to solve games at a very large scale both in terms of population size and environment complexity. In this survey, we review the quickly growing recent literature on RL methods to learn Nash equilibria in MFGs. We first identify the most common settings (static, stationary, and evolutive). We then present a general framework for classical iterative methods (based on best-response computation or policy evaluation) to solve MFGs in an exact way. Building on these algorithms and the connection with Markov Decision Processes, we explain how RL can be used to learn MFG solutions in a model-free way. Last, we present numerical illustrations on a benchmark problem, and conclude with some perspectives.
    Optimizing Warfarin Dosing using Deep Reinforcement Learning. (arXiv:2202.03486v2 [cs.LG] UPDATED)
    Warfarin is a widely used anticoagulant, and has a narrow therapeutic range. Dosing of warfarin should be individualized, since slight overdosing or underdosing can have catastrophic or even fatal consequences. Despite much research on warfarin dosing, current dosing protocols do not live up to expectations, especially for patients sensitive to warfarin. We propose a deep reinforcement learning-based dosing model for warfarin. To overcome the issue of relatively small sample sizes in dosing trials, we use a Pharmacokinetic/ Pharmacodynamic (PK/PD) model of warfarin to simulate dose-responses of virtual patients. Applying the proposed algorithm on virtual test patients shows that this model outperforms a set of clinically accepted dosing protocols by a wide margin. We tested the robustness of our dosing protocol on a second PK/PD model and showed that its performance is comparable to the set of baseline protocols.
    Understanding Programmatic Weak Supervision via Source-aware Influence Function. (arXiv:2205.12879v1 [cs.LG])
    Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. To achieve this, we build on Influence Function (IF) and propose source-aware IF, which leverages the generation process of the probabilistic labels to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) tuple. These primitive influence score can then be used to estimate the influence of individual component of PWS, such as source vote, supervision source, and training data. On datasets of diverse domains, we demonstrate multiple use cases: (1) interpreting incorrect predictions from multiple angles that reveals insights for debugging the PWS pipeline, (2) identifying mislabeling of sources with a gain of 9%-37% over baselines, and (3) improving the end model's generalization performance by removing harmful components in the training objective (13%-24% better than ordinary IF).
    Learning Distributions by Generative Adversarial Networks: Approximation and Generalization. (arXiv:2205.12601v1 [cs.LG])
    We study how well generative adversarial networks (GAN) learn probability distributions from finite samples by analyzing the convergence rates of these models. Our analysis is based on a new oracle inequality that decomposes the estimation error of GAN into the discriminator and generator approximation errors, generalization error and optimization error. To estimate the discriminator approximation error, we establish error bounds on approximating H\"older functions by ReLU neural networks, with explicit upper bounds on the Lipschitz constant of the network or norm constraint on the weights. For generator approximation error, we show that neural network can approximately transform a low-dimensional source distribution to a high-dimensional target distribution and bound such approximation error by the width and depth of neural network. Combining the approximation results with generalization bounds of neural networks from statistical learning theory, we establish the convergence rates of GANs in various settings, when the error is measured by a collection of integral probability metrics defined through H\"older classes, including the Wasserstein distance as a special case. In particular, for distributions concentrated around a low-dimensional set, we show that the convergence rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension.
    Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation. (arXiv:2109.06604v2 [cs.CL] UPDATED)
    Recently, $k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, it heavily relies on high-quality in-domain parallel corpora, limiting its capability on unsupervised domain adaptation, where in-domain parallel corpora are scarce or nonexistent. In this paper, we propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval. To this end, we first introduce an autoencoder task based on the target language, and then insert lightweight adapters into the original NMT model to map the token-level representation of this task to the ideal representation of translation task. Experiments on multi-domain datasets demonstrate that our proposed approach significantly improves the translation accuracy with target-side monolingual data, while achieving comparable performance with back-translation.
    From Noisy Prediction to True Label: Noisy Prediction Calibration via Generative Model. (arXiv:2205.00690v2 [cs.LG] UPDATED)
    Noisy labels are inevitable yet problematic in machine learning society. It ruins the generalization power of a classifier by making the classifier be trained to be overfitted to wrong labels. Existing methods on noisy label have focused on modifying classifier training procedure. It results in two possible problems. First, these methods are not applicable to a pre-trained classifier without further access into training. Second, it is not easy to train a classifier and remove all of negative effects from noisy labels simultaneously. From these problems, we suggests a new branch of approach, Noisy Prediction Calibration (NPC) in learning with noisy labels. Through the introduction and estimation of a new type of transition matrix via generative model, NPC corrects the noisy prediction from the pre-trained classifier to the true label as a post-processing scheme. We prove that NPC theoretically aligns with the transition matrix based methods. Yet, NPC provides more accurate pathway to estimate true label, even without involvement in classifier learning. Also, NPC is applicable to any classifier trained with noisy label methods, if training instances and its predictions are available. Our method, NPC, boosts the classification performances of all baseline models on both synthetic and real-world datasets.
    Certified Robustness Against Natural Language Attacks by Causal Intervention. (arXiv:2205.12331v1 [cs.LG])
    Deep learning models have achieved great success in many fields, yet they are vulnerable to adversarial examples. This paper follows a causal perspective to look into the adversarial vulnerability and proposes Causal Intervention by Semantic Smoothing (CISS), a novel framework towards robustness against natural language attacks. Instead of merely fitting observational data, CISS learns causal effects p(y|do(x)) by smoothing in the latent semantic space to make robust predictions, which scales to deep architectures and avoids tedious construction of noise customized for specific attacks. CISS is provably robust against word substitution attacks, as well as empirically robust even when perturbations are strengthened by unknown attack algorithms. For example, on YELP, CISS surpasses the runner-up by 6.7% in terms of certified robustness against word substitutions, and achieves 79.4% empirical robustness when syntactic attacks are integrated.
    Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently. (arXiv:2205.12808v1 [cs.LG])
    Driven by the empirical success and wide use of deep neural networks, understanding the generalization performance of overparameterized models has become an increasingly popular question. To this end, there has been substantial effort to characterize the implicit bias of the optimization algorithms used, such as gradient descent (GD), and the structural properties of their preferred solutions. This paper answers an open question in this literature: For the classification setting, what solution does mirror descent (MD) converge to? Specifically, motivated by its efficient implementation, we consider the family of mirror descent algorithms with potential function chosen as the $p$-th power of the $\ell_p$-norm, which is an important generalization of GD. We call this algorithm $p$-$\textsf{GD}$. For this family, we characterize the solutions it obtains and show that it converges in direction to a generalized maximum-margin solution with respect to the $\ell_p$-norm for linearly separable classification. While the MD update rule is in general expensive to compute and perhaps not suitable for deep learning, $p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can be used to train deep neural networks with virtually no additional computational overhead. Using comprehensive experiments with both linear and deep neural network models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect the structure and the generalization performance of the learned models.
    Graph Neural Networks Designed for Different Graph Types: A Survey. (arXiv:2204.03080v2 [cs.LG] UPDATED)
    Graphs are ubiquitous in nature and can therefore serve as models for many practical but also theoretical problems. Based on this, the young research field of Graph Neural Networks (GNNs) has emerged. Despite the youth of the field and the speed in which new models are developed, many good surveys have been published in the last years. Nevertheless, an overview on which graph types can be modeled by GNNs is missing. In this survey, we give a detailed overview of already existing GNNs and, unlike previous surveys, categorize them according to their ability to handle different graph types and properties. We consider GNNs operating on static as well as on dynamic graphs of different structural constitutions, with or without node or edge attributes. Moreover in the dynamic case, we separate the models in discrete-time and continuous-time dynamic graphs based on their architecture. While ordering the existing GNN models, we find, that there are still graph types, that are not or only rarely covered by existing GNN models. We point out where models are missing and give potential reasons for their absence.
    RADNet: Ensemble Model for Robust Glaucoma Classification in Color Fundus Images. (arXiv:2205.12902v1 [eess.IV])
    Glaucoma is one of the most severe eye diseases, characterized by rapid progression and leading to irreversible blindness. It is often the case that pathology diagnostics is carried out when the one's sight has already significantly degraded due to the lack of noticeable symptoms at early stage of the disease. Regular glaucoma screenings of the population shall improve early-stage detection, however the desirable frequency of etymological checkups is often not feasible due to excessive load imposed by manual diagnostics on limited number of specialists. Considering the basic methodology to detect glaucoma is to analyze fundus images for the \textit{optic-disc-to-optic-cup ratio}, Machine Learning domain can offer sophisticated tooling for image processing and classification. In our work, we propose an advanced image pre-processing technique combined with an ensemble of deep classification networks. Our \textit{Retinal Auto Detection (RADNet)} model has been successfully tested on Rotterdam EyePACS AIROGS train dataset with AUC of 0.92, and then additionally finetuned and tested on a fraction of RIM-ONE DL dataset with AUC of 0.91.
    Service Discovery in Social Internet of Things using Graph Neural Networks. (arXiv:2205.12711v1 [cs.LG])
    Internet-of-Things (IoT) networks intelligently connect thousands of physical entities to provide various services for the community. It is witnessing an exponential expansion, which is complicating the process of discovering IoT devices existing in the network and requesting corresponding services from them. As the highly dynamic nature of the IoT environment hinders the use of traditional solutions of service discovery, we aim, in this paper, to address this issue by proposing a scalable resource allocation neural model adequate for heterogeneous large-scale IoT networks. We devise a Graph Neural Network (GNN) approach that utilizes the social relationships formed between the devices in the IoT network to reduce the search space of any entity lookup and acquire a service from another device in the network. This proposed resource allocation approach surpasses standardization issues and embeds the structure and characteristics of the social IoT graph, by the means of GNNs, for eventual clustering analysis process. Simulation results applied on a real-world dataset illustrate the performance of this solution and its significant efficiency to operate on large-scale IoT networks.
    Core Challenges in Embodied Vision-Language Planning. (arXiv:2106.13948v4 [cs.LG] UPDATED)
    Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v1 [cs.LG])
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    Conditional Gradients for the Approximately Vanishing Ideal. (arXiv:2202.03349v8 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the Conditional Gradients Approximately Vanishing Ideal algorithm (CGAVI) for the construction of the set of generators of the approximately vanishing ideal. The constructed set of generators captures polynomial structures in data and gives rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In CGAVI, we construct the set of generators by solving specific instances of (constrained) convex optimization problems with the Pairwise Frank-Wolfe algorithm (PFW). Among other things, the constructed generators inherit the LASSO generalization bound and not only vanish on the training but also on out-sample data. Moreover, CGAVI admits a compact representation of the approximately vanishing ideal by constructing few generators with sparse coefficient vectors.
    Women, artificial intelligence, and key positions in collaboration networks: Towards a more equal scientific ecosystem. (arXiv:2205.12339v1 [cs.SI])
    Scientific collaboration in almost every discipline is mainly driven by the need of sharing knowledge, expertise, and pooled resources. Science is becoming more complex which has encouraged scientists to involve more in collaborative research projects in order to better address the challenges. As a highly interdisciplinary field with a rapidly evolving scientific landscape, artificial intelligence calls for researchers with special profiles covering a diverse set of skills and expertise. Understanding gender aspects of scientific collaboration is of paramount importance, especially in a field such as artificial intelligence that has been attracting large investments. Using social network analysis, natural language processing, and machine learning and focusing on artificial intelligence publications for the period from 2000 to 2019, in this work, we comprehensively investigated the effects of several driving factors on acquiring key positions in scientific collaboration networks through a gender lens. It was found that, regardless of gender, scientific performance in terms of quantity and impact plays a crucial in possessing the "social researcher" in the network. However, subtle differences were observed between female and male researchers in acquiring the "local influencer" role.
    A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning. (arXiv:2201.01221v2 [cs.LG] UPDATED)
    Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.
    Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe Algorithm under Parallelization. (arXiv:2205.12751v1 [math.OC])
    We consider the problem of minimizing the sum of two convex functions. One of those functions has Lipschitz-continuous gradients, and can be accessed via stochastic oracles, whereas the other is "simple". We provide a Bregman-type algorithm with accelerated convergence in function values to a ball containing the minimum. The radius of this ball depends on problem-dependent constants, including the variance of the stochastic oracle. We further show that this algorithmic setup naturally leads to a variant of Frank-Wolfe achieving acceleration under parallelization. More precisely, when minimizing a smooth convex function on a bounded domain, we show that one can achieve an $\epsilon$ primal-dual gap (in expectation) in $\tilde{O}(1/ \sqrt{\epsilon})$ iterations, by only accessing gradients of the original function and a linear maximization oracle with $O(1/\sqrt{\epsilon})$ computing units in parallel. We illustrate this fast convergence on synthetic numerical experiments.
    Is a Question Decomposition Unit All We Need?. (arXiv:2205.12538v1 [cs.CL])
    Large Language Models (LMs) have achieved state-of-the-art performance on many Natural Language Processing (NLP) benchmarks. With the growing number of new benchmarks, we build bigger and more complex LMs. However, building new LMs may not be an ideal option owing to the cost, time and environmental impact associated with it. We explore an alternative route: can we modify data by expressing it in terms of the model's strengths, so that a question becomes easier for models to answer? We investigate if humans can decompose a hard question into a set of simpler questions that are relatively easier for models to solve. We analyze a range of datasets involving various forms of reasoning and find that it is indeed possible to significantly improve model performance (24% for GPT3 and 29% for RoBERTa-SQuAD along with a symbolic calculator) via decomposition. Our approach provides a viable option to involve people in NLP research in a meaningful way. Our findings indicate that Human-in-the-loop Question Decomposition (HQD) can potentially provide an alternate path to building large LMs.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v2 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v1 [stat.ML])
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity. (arXiv:2104.13790v3 [cs.LG] UPDATED)
    AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is theoretically appealing in that it has a data-dependent $O(\sqrt{T})$ regret bound when objective functions are convex, where $T$ is a time horizon. It remains however an open problem whether the convergence rate can be further improved without sacrificing its generalization ability. %on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To this end, we make a first attempt in this work and design a novel optimization algorithm called FastAdaBelief that aims to exploit its strong convexity in order to achieve an even faster convergence rate. In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence. As an important theoretical contribution, we prove that FastAdaBelief attains a data-dependant $O(\log T)$ regret bound, which is substantially lower than AdaBelief. On the empirical side, we validate our theoretical analysis with extensive experiments in both scenarios of strong and non-strong convexity on three popular baseline models. Experimental results are very encouraging: FastAdaBelief converges the quickest in comparison to all mainstream algorithms while maintaining an excellent generalization ability, in cases of both strong or non-strong convexity. FastAdaBelief is thus posited as a new benchmark model for the research community.
    Detecting Multi-Sensor Fusion Errors in Advanced Driver-Assistance Systems. (arXiv:2109.06404v3 [cs.RO] UPDATED)
    Advanced Driver-Assistance Systems (ADAS) have been thriving and widely deployed in recent years. In general, these systems receive sensor data, compute driving decisions, and output control signals to the vehicles. To smooth out the uncertainties brought by sensor outputs, they usually leverage multi-sensor fusion (MSF) to fuse the sensor outputs and produce a more reliable understanding of the surroundings. However, MSF cannot completely eliminate the uncertainties since it lacks the knowledge about which sensor provides the most accurate data and how to optimally integrate the data provided by the sensors. As a result, critical consequences might happen unexpectedly. In this work, we observed that the popular MSF methods in an industry-grade ADAS can mislead the car control and result in serious safety hazards. We define the failures (e.g., car crashes) caused by the faulty MSF as fusion errors and develop a novel evolutionary-based domain-specific search framework, FusED, for the efficient detection of fusion errors. We further apply causality analysis to show that the found fusion errors are indeed caused by the MSF method. We evaluate our framework on two widely used MSF methods in two driving environments. Experimental results show that FusED identifies more than 150 fusion errors. Finally, we provide several suggestions to improve the MSF methods we study.
    Label Leakage and Protection from Forward Embedding in Vertical Federated Learning. (arXiv:2203.01451v3 [cs.LG] UPDATED)
    Vertical federated learning (vFL) has gained much attention and been deployed to solve machine learning problems with data privacy concerns in recent years. However, some recent work demonstrated that vFL is vulnerable to privacy leakage even though only the forward intermediate embedding (rather than raw features) and backpropagated gradients (rather than raw labels) are communicated between the involved participants. As the raw labels often contain highly sensitive information, some recent work has been proposed to prevent the label leakage from the backpropagated gradients effectively in vFL. However, these work only identified and defended the threat of label leakage from the backpropagated gradients. None of these work has paid attention to the problem of label leakage from the intermediate embedding. In this paper, we propose a practical label inference method which can steal private labels effectively from the shared intermediate embedding even though some existing protection methods such as label differential privacy and gradients perturbation are applied. The effectiveness of the label attack is inseparable from the correlation between the intermediate embedding and corresponding private labels. To mitigate the issue of label leakage from the forward embedding, we add an additional optimization goal at the label party to limit the label stealing ability of the adversary by minimizing the distance correlation between the intermediate embedding and corresponding private labels. We conducted massive experiments to demonstrate the effectiveness of our proposed protection methods.
    Interpretable Feature Engineering for Time Series Predictors using Attention Networks. (arXiv:2205.12723v1 [cs.LG])
    Regression problems with time-series predictors are common in banking and many other areas of application. In this paper, we use multi-head attention networks to develop interpretable features and use them to achieve good predictive performance. The customized attention layer explicitly uses multiplicative interactions and builds feature-engineering heads that capture temporal dynamics in a parsimonious manner. Convolutional layers are used to combine multivariate time series. We also discuss methods for handling static covariates in the modeling process. Visualization and explanation tools are used to interpret the results and explain the relationship between the inputs and the extracted features. Both simulation and real dataset are used to illustrate the usefulness of the methodology. Keyword: Attention heads, Deep neural networks, Interpretable feature engineering
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v2 [stat.ML] UPDATED)
    We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Structured Uncertainty in the Observation Space of Variational Autoencoders. (arXiv:2205.12533v1 [cs.LG])
    Variational autoencoders (VAEs) are a popular class of deep generative models with many variants and a wide range of applications. Improvements upon the standard VAE mostly focus on the modelling of the posterior distribution over the latent space and the properties of the neural network decoder. In contrast, improving the model for the observational distribution is rarely considered and typically defaults to a pixel-wise independent categorical or normal distribution. In image synthesis, sampling from such distributions produces spatially-incoherent results with uncorrelated pixel noise, resulting in only the sample mean being somewhat useful as an output prediction. In this paper, we aim to stay true to VAE theory by improving the samples from the observational distribution. We propose an alternative model for the observation space, encoding spatial dependencies via a low-rank parameterisation. We demonstrate that this new observational distribution has the ability to capture relevant covariance between pixels, resulting in spatially-coherent samples. In contrast to pixel-wise independent distributions, our samples seem to contain semantically meaningful variations from the mean allowing the prediction of multiple plausible outputs with a single forward pass.
    Global geomagnetic perturbation forecasting using Deep Learning. (arXiv:2205.12734v1 [physics.space-ph])
    Geomagnetically Induced Currents (GICs) arise from spatio-temporal changes to Earth's magnetic field which arise from the interaction of the solar wind with Earth's magnetosphere, and drive catastrophic destruction to our technologically dependent society. Hence, computational models to forecast GICs globally with large forecast horizon, high spatial resolution and temporal cadence are of increasing importance to perform prompt necessary mitigation. Since GIC data is proprietary, the time variability of horizontal component of the magnetic field perturbation (dB/dt) is used as a proxy for GICs. In this work, we develop a fast, global dB/dt forecasting model, which forecasts 30 minutes into the future using only solar wind measurements as input. The model summarizes 2 hours of solar wind measurement using a Gated Recurrent Unit, and generates forecasts of coefficients which are folded with a spherical harmonic basis to enable global forecasts. When deployed, our model produces results in under a second, and generates global forecasts for horizontal magnetic perturbation components at 1-minute cadence. We evaluate our model across models in literature for two specific storms of 5 August 2011 and 17 March 2015, while having a self-consistent benchmark model set. Our model outperforms, or has consistent performance with state-of-the-practice high time cadence local and low time cadence global models, while also outperforming/having comparable performance with the benchmark models. Such quick inferences at high temporal cadence and arbitrary spatial resolutions may ultimately enable accurate forewarning of dB/dt for any place on Earth, resulting in precautionary measures to be taken in an informed manner.
    Label Leakage and Protection in Two-party Split Learning. (arXiv:2102.08504v3 [cs.LG] UPDATED)
    Two-party split learning is a popular technique for learning a model across feature-partitioned data. In this work, we explore whether it is possible for one party to steal the private label information from the other party during split training, and whether there are methods that can protect against such attacks. Specifically, we first formulate a realistic threat model and propose a privacy loss metric to quantify label leakage in split learning. We then show that there exist two simple yet effective methods within the threat model that can allow one party to accurately recover private ground-truth labels owned by the other party. To combat these attacks, we propose several random perturbation techniques, including $\texttt{Marvell}$, an approach that strategically finds the structure of the noise perturbation by minimizing the amount of label leakage (measured through our quantification metric) of a worst-case adversary. We empirically demonstrate the effectiveness of our protection techniques against the identified attacks, and show that $\texttt{Marvell}$ in particular has improved privacy-utility tradeoffs relative to baseline approaches.
    Domain Adaptation via Maximizing Surrogate Mutual Information. (arXiv:2110.12184v2 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims to predict unlabeled data from target domain with access to labeled data from the source domain. In this work, we propose a novel framework called SIDA (Surrogate Mutual Information Maximization Domain Adaptation) with strong theoretical guarantees. To be specific, SIDA implements adaptation by maximizing mutual information (MI) between features. In the framework, a surrogate joint distribution models the underlying joint distribution of the unlabeled target domain. Our theoretical analysis validates SIDA by bounding the expected risk on target domain with MI and surrogate distribution bias. Experiments show that our approach is comparable with state-of-the-art unsupervised adaptation methods on standard UDA tasks.
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v3 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework targets the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where exact inference of latent variables is typically intractable. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space often lacks physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification of nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential for enhancing and generalizing the predictive capabilities of deep learning-based models.
    Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. (arXiv:2205.12401v1 [cs.LG])
    Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-based RL algorithms, as tailored human feedback is very expensive. To handle this issue, previous methods have mainly focused on improving query selection and policy initialization. At the same time, recent exploration methods have proven to be a recipe for improving sample-efficiency in RL. We present an exploration method specifically for preference-based RL algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Specifically, we utilize disagreement across ensemble of learned reward models. Our intuition is that disagreement in learned reward model reflects uncertainty in tailored human feedback and could be useful for exploration. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from MetaWorld benchmarks, compared with other existing exploration methods that measure the novelty of state visitation.
    Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT. (arXiv:2205.12399v1 [cs.LG])
    We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. The Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms (<0.2%) BERT on SuperGLUE, but trains and runs nearly twice as fast: 89% faster training and 98% faster inference. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and model hyperparameters. The Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v1 [cs.LG])
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
    GuardNN: Secure Accelerator Architecture for Privacy-Preserving Deep Learning. (arXiv:2008.11632v2 [cs.CR] UPDATED)
    This paper proposes GuardNN, a secure DNN accelerator that provides hardware-based protection for user data and model parameters even in an untrusted environment. GuardNN shows that the architecture and protection can be customized for a specific application to provide strong confidentiality and integrity guarantees with negligible overhead. The design of the GuardNN instruction set reduces the TCB to just the accelerator and allows confidentiality protection even when the instructions from a host cannot be trusted. GuardNN minimizes the overhead of memory encryption and integrity verification by customizing the off-chip memory protection for the known memory access patterns of a DNN accelerator. GuardNN is prototyped on an FPGA, demonstrating effective confidentiality protection with ~3% performance overhead for inference.
    On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity. (arXiv:2205.12642v1 [stat.ML])
    Most complex machine learning and modelling techniques are prone to over-fitting and may subsequently generalise poorly to future data. Artificial neural networks are no different in this regard and, despite having a level of implicit regularisation when trained with gradient descent, often require the aid of explicit regularisers. We introduce a new framework, Model Gradient Similarity (MGS), that (1) serves as a metric of regularisation, which can be used to monitor neural network training, (2) adds insight into how explicit regularisers, while derived from widely different principles, operate via the same mechanism underneath by increasing MGS, and (3) provides the basis for a new regularisation scheme which exhibits excellent performance, especially in challenging settings such as high levels of label noise or limited sample sizes.
    Convolutional Neural Processes for Inpainting Satellite Images. (arXiv:2205.12407v1 [cs.CV])
    The widespread availability of satellite images has allowed researchers to model complex systems such as disease dynamics. However, many satellite images have missing values due to measurement defects, which render them unusable without data imputation. For example, the scanline corrector for the LANDSAT 7 satellite broke down in 2003, resulting in a loss of around 20\% of its data. Inpainting involves predicting what is missing based on the known pixels and is an old problem in image processing, classically based on PDEs or interpolation methods, but recent deep learning approaches have shown promise. However, many of these methods do not explicitly take into account the inherent spatiotemporal structure of satellite images. In this work, we cast satellite image inpainting as a natural meta-learning problem, and propose using convolutional neural processes (ConvNPs) where we frame each satellite image as its own task or 2D regression problem. We show ConvNPs can outperform classical methods and state-of-the-art deep learning inpainting models on a scanline inpainting problem for LANDSAT 7 satellite images, assessed on a variety of in and out-of-distribution images.
    Learning from time-dependent streaming data with online stochastic algorithms. (arXiv:2205.12549v1 [cs.LG])
    We study stochastic algorithms in a streaming framework, trained on samples coming from a dependent data source. In this streaming framework, we analyze the convergence of Stochastic Gradient (SG) methods in a non-asymptotic manner; this includes various SG methods such as the well-known stochastic gradient descent (i.e., Robbins-Monro algorithm), mini-batch SG methods, together with their averaged estimates (i.e., Polyak-Ruppert averaged). Our results form a heuristic by linking the level of dependency and convexity to the rest of the model parameters. This heuristic provides new insights into choosing the optimal learning rate, which can help increase the stability of SGbased methods; these investigations suggest large streaming batches with slow decaying learning rates for highly dependent data sources.
    Stochastic Second-Order Methods Provably Beat SGD For Gradient-Dominated Functions. (arXiv:2205.12856v1 [cs.LG])
    We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that SCRN improves the best-known sample complexity of stochastic gradient descent in achieving $\epsilon$-global optimum by a factor of $\mathcal{O}(\epsilon^{-1/2})$. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the sample complexity of SCRN can be improved by a factor of ${\mathcal{O}}(\epsilon^{-1/2})$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
    Learning the Travelling Salesperson Problem Requires Rethinking Generalization. (arXiv:2006.07054v6 [cs.LG] UPDATED)
    End-to-end training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While state-of-the-art learning-driven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an end-to-end neural combinatorial optimization pipeline that unifies several recent papers in order to identify the inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zero-shot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols. Additionally, we analyze recent advances in deep learning for routing problems through the lens of our pipeline and provide new directions to stimulate future research.
    Hardness of Maximum Likelihood Learning of DPPs. (arXiv:2205.12377v1 [cs.CC])
    Determinantal Point Processes (DPPs) are a widely used probabilistic model for negatively correlated sets. DPPs have been successfully employed in Machine Learning applications to select a diverse, yet representative subset of data. In seminal work on DPPs in Machine Learning, Kulesza conjectured in his PhD Thesis (2011) that the problem is NP-complete. The lack of a formal proof prompted Brunel, Moitra, Rigollet and Urschel (COLT 2017) to conjecture that, in opposition to Kulesza's conjecture, there exists a polynomial-time algorithm for computing a maximum-likelihood DPP. They also presented some preliminary evidence supporting their conjecture. In this work we prove Kulesza's conjecture. In fact, we prove the following stronger hardness of approximation result: even computing a $\left(1-O(\frac{1}{\log^9{N}})\right)$-approximation to the maximum log-likelihood of a DPP on a ground set of $N$ elements is NP-complete. At the same time, we also obtain the first polynomial-time algorithm that achieves a nontrivial worst-case approximation to the optimal log-likelihood: the approximation factor is $\frac{1}{(1+o(1))\log{m}}$ unconditionally (for data sets that consist of $m$ subsets), and can be improved to $1-\frac{1+o(1)}{\log N}$ if all $N$ elements appear in a $O(1/N)$-fraction of the subsets. In terms of techniques, we reduce approximating the maximum log-likelihood of DPPs on a data set to solving a gap instance of a "vector coloring" problem on a hypergraph. Such a hypergraph is built on a bounded-degree graph construction of Bogdanov, Obata and Trevisan (FOCS 2002), and is further enhanced by the strong expanders of Alon and Capalbo (FOCS 2007) to serve our purposes.
    Action Recognition for American Sign Language. (arXiv:2205.12261v1 [cs.CV])
    In this research, we present our findings to recognize American Sign Language from series of hand gestures. While most researches in literature focus only on static handshapes, our work target dynamic hand gestures. Since dynamic signs dataset are very few, we collect an initial dataset of 150 videos for 10 signs and an extension of 225 videos for 15 signs. We apply transfer learning models in combination with deep neural networks and background subtraction for videos in different temporal settings. Our primarily results show that we can get an accuracy of $0.86$ and $0.71$ using DenseNet201, LSTM with video sequence of 12 frames accordingly.
    Learning JPEG Compression Artifacts for Image Manipulation Detection and Localization. (arXiv:2108.12947v2 [eess.IV] UPDATED)
    Detecting and localizing image manipulation are necessary to counter malicious use of image editing techniques. Accordingly, it is essential to distinguish between authentic and tampered regions by analyzing intrinsic statistics in an image. We focus on JPEG compression artifacts left during image acquisition and editing. We propose a convolutional neural network (CNN) that uses discrete cosine transform (DCT) coefficients, where compression artifacts remain, to localize image manipulation. Standard CNNs cannot learn the distribution of DCT coefficients because the convolution throws away the spatial coordinates, which are essential for DCT coefficients. We illustrate how to design and train a neural network that can learn the distribution of DCT coefficients. Furthermore, we introduce Compression Artifact Tracing Network (CAT-Net) that jointly uses image acquisition artifacts and compression artifacts. It significantly outperforms traditional and deep neural network-based methods in detecting and localizing tampered regions.
    Learning to Model Editing Processes. (arXiv:2205.12374v1 [cs.CL])
    Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in contrast with a more natural approach that humans use in generating content; iterative refinement and editing. Recent work has introduced edit-based models for various tasks (such as neural machine translation and text style transfer), but these generally model a single edit step. In this work, we propose modeling editing processes, modeling the whole process of iteratively generating sequences. We form a conceptual framework to describe the likelihood of multi-step edits, and describe neural models that can learn a generative model of sequences based on these multistep edits. We introduce baseline results and metrics on this task, finding that modeling editing processes improves performance on a variety of axes on both our proposed task and related downstream tasks compared to previous single-step models of edits.
    AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models. (arXiv:2205.12410v1 [cs.CL])
    Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. Parameter-efficient techniques have been developed that tune small trainable components (e.g., adapters) injected in the large model while keeping most of the model weights frozen. The prevalent mechanism to increase adapter capacity is to increase the bottleneck dimension which increases the adapter parameters. In this work, we introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques. (i) We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. (ii) We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. We demonstrate these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks. By only tuning 0.23% of a pre-trained language model's parameters, our model outperforms the full model fine-tuning performance and several competing methods.
    Over-the-Air Design of GAN Training for mmWave MIMO Channel Estimation. (arXiv:2205.12445v1 [eess.SP])
    Future wireless systems are trending towards higher carrier frequencies that offer larger communication bandwidth but necessitate the use of large antenna arrays. Existing signal processing techniques for channel estimation do not scale well to this "high-dimensional" regime in terms of performance and pilot overhead. Meanwhile, training deep learning based approaches for channel estimation requires large labeled datasets mapping pilot measurements to clean channel realizations, which can only be generated offline using simulated channels. In this paper, we develop a novel unsupervised over-the-air (OTA) algorithm that utilizes noisy received pilot measurements to train a deep generative model to output beamspace MIMO channel realizations. Our approach leverages Generative Adversarial Networks (GAN), while using a conditional input to distinguish between Line-of-Sight (LOS) and Non-Line-of-Sight (NLOS) channel realizations. We also present a federated implementation of the OTA algorithm that distributes the GAN training over multiple users and greatly reduces the user side computation. We then formulate channel estimation from a limited number of pilot measurements as an inverse problem and reconstruct the channel by optimizing the input vector of the trained generative model. Our proposed approach significantly outperforms Orthogonal Matching Pursuit on both LOS and NLOS channel models, and EM-GM-AMP -- an Approximate Message Passing algorithm -- on LOS channel models, while achieving comparable performance on NLOS channel models in terms of the normalized channel reconstruction error. More importantly, our proposed framework has the potential to be trained online using real noisy pilot measurements, is not restricted to a specific channel model and can even be utilized for a federated OTA design of a dataset generator from noisy data.
    VeriFi: Towards Verifiable Federated Unlearning. (arXiv:2205.12709v1 [cs.CR])
    Federated learning (FL) is a collaborative learning paradigm where participants jointly train a powerful model without sharing their private data. One desirable property for FL is the implementation of the right to be forgotten (RTBF), i.e., a leaving participant has the right to request to delete its private data from the global model. However, unlearning itself may not be enough to implement RTBF unless the unlearning effect can be independently verified, an important aspect that has been overlooked in the current literature. In this paper, we prompt the concept of verifiable federated unlearning, and propose VeriFi, a unified framework integrating federated unlearning and verification that allows systematic analysis of the unlearning and quantification of its effect, with different combinations of multiple unlearning and verification methods. In VeriFi, the leaving participant is granted the right to verify (RTV), that is, the participant notifies the server before leaving, then actively verifies the unlearning effect in the next few communication rounds. The unlearning is done at the server side immediately after receiving the leaving notification, while the verification is done locally by the leaving participant via two steps: marking (injecting carefully-designed markers to fingerprint the leaver) and checking (examining the change of the global model's performance on the markers). Based on VeriFi, we conduct the first systematic and large-scale study for verifiable federated unlearning, considering 7 unlearning methods and 5 verification methods. Particularly, we propose a more efficient and FL-friendly unlearning method, and two more effective and robust non-invasive-verification methods. We extensively evaluate VeriFi on 7 datasets and 4 types of deep learning models. Our analysis establishes important empirical understandings for more trustworthy federated unlearning.
    A Kernel Stein Test for Comparing Latent Variable Models. (arXiv:1907.00586v4 [stat.ML] UPDATED)
    We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.
    MAPLE: Microprocessor A Priori for Latency Estimation. (arXiv:2111.15106v2 [cs.LG] UPDATED)
    Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. As such, neural architecture search (NAS) algorithms take these two constraints into account when generating a new architecture. However, efficiency metrics such as latency are typically hardware dependent requiring the NAS algorithm to either measure or predict the architecture latency. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. Here we propose Microprocessor A Priori for Latency Estimation MAPLE that does not rely on transfer learning or domain adaptation but instead generalizes to new hardware by incorporating a prior hardware characteristics during training. MAPLE takes advantage of a novel quantitative strategy to characterize the underlying microprocessor by measuring relevant hardware performance metrics, yielding a fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE benefits from the tightly coupled I/O between the CPU and GPU and their dependency to predict DNN latency on GPUs while measuring microprocessor performance hardware counters from the CPU feeding the GPU hardware. Through this quantitative strategy as the hardware descriptor, MAPLE can generalize to new hardware via a few shot adaptation strategy where with as few as 3 samples it exhibits a 6% improvement over state-of-the-art methods requiring as much as 10 samples. Experimental results showed that, increasing the few shot adaptation samples to 10 improves the accuracy significantly over the state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE exhibiting 8-10% better accuracy, on average, compared to relevant baselines at any number of adaptation samples.
    When Is Partially Observable Reinforcement Learning Not Scary?. (arXiv:2204.08967v2 [cs.LG] UPDATED)
    Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.
    ColdGuess: A General and Effective Relational Graph Convolutional Network to Tackle Cold Start Cases. (arXiv:2205.12318v1 [cs.LG])
    Low-quality listings and bad actor behavior in online retail websites threatens e-commerce business as these result in sub-optimal buying experience and erode customer trust. When a new listing is created, how to tell it has good-quality? Is the method effective, fast, and scalable? Previous approaches often have three limitations/challenges: (1) unable to handle cold start problems where new sellers/listings lack sufficient selling histories. (2) inability of scoring hundreds of millions of listings at scale, or compromise performance for scalability. (3) has space challenges from large-scale graph with giant e-commerce business size. To overcome these limitations/challenges, we proposed ColdGuess, an inductive graph-based risk predictor built upon a heterogeneous seller product graph, which effectively identifies risky seller/product/listings at scale. ColdGuess tackles the large-scale graph by consolidated nodes, and addresses the cold start problems using homogeneous influence1. The evaluation on real data demonstrates that ColdGuess has stable performance as the number of unknown features increases. It outperforms the lightgbm2 by up to 34 pcp ROC-AUC in a cold start case when a new seller sells a new product . The resulting system, ColdGuess, is effective, adaptable to changing risky seller behavior, and is already in production
    Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks. (arXiv:2105.05553v7 [cs.LG] UPDATED)
    Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis reveals that, when the hidden layers are wide enough, the convergence rate of this model's parameters is exponentially faster along the directions of the larger principal components of the data, at a rate governed by the corresponding singular values. We term this convergence pattern the Principal Components bias (PC-bias). Empirically, we show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently at earlier stages of learning. We then compare our results to the simplicity bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias may explain some benefits of early stopping and its connection to PCA, and why deep networks converge more slowly with random labels.
    Fast & Furious: Modelling Malware Detection as Evolving Data Streams. (arXiv:2205.12311v1 [cs.CR])
    Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (~130K apps) and AndroZoo (~350K apps). Android is a ubiquitous operating system for smartphones, which stimulates attackers to regularly create and update malware to the platform. We conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches. As a result, we observed that updating every component of the pipeline in response to concept drifts allows the classification model to achieve increasing detection rates as the data representation (extracted features) is updated. Furthermore, we discuss the impact of the changes on the classification models by comparing the variations in the extracted features.
    TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models. (arXiv:2205.12372v1 [cs.LG])
    We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any architecture supported by PyTorch, such as convolutional networks. A feature of the library is that we expose the user to layerwise NTK components, and show that in some regimes a layerwise calculation is more memory efficient. We conduct preliminary experiments to demonstrate use cases for the software and probe the NTK.
    Multi-Agent Low-Dimensional Linear Bandits. (arXiv:2007.01442v4 [cs.LG] UPDATED)
    We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. By distributing the search for the optimal subspace across users and learning of the unknown vector by each agent in the corresponding low-dimensional subspace, we show that the per-agent finite-time regret is much smaller than the case when agents do not communicate. We finally complement these results through simulations.
    EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures. (arXiv:2205.10390v2 [cs.LG] UPDATED)
    Protein complexes are macromolecules essential to the functioning and well-being of all living organisms. As the structure of a protein complex, in particular its region of interaction between multiple protein subunits (i.e., chains), has a notable influence on the biological function of the complex, computational methods that can quickly and effectively be used to refine and assess the quality of a protein complex's 3D structure can directly be used within a drug discovery pipeline to accelerate the development of new therapeutics and improve the efficacy of future vaccines. In this work, we introduce the Equivariant Graph Refiner (EGR), a novel E(3)-equivariant graph neural network (GNN) for multi-task structure refinement and assessment of protein complexes. Our experiments on new, diverse protein complex datasets, all of which we make publicly available in this work, demonstrate the state-of-the-art effectiveness of EGR for atomistic refinement and assessment of protein complexes and outline directions for future work in the field. In doing so, we establish a baseline for future studies in macromolecular refinement and structure analysis.
    K-12BERT: BERT for K-12 education. (arXiv:2205.12335v1 [cs.CL])
    Online education platforms are powered by various NLP pipelines, which utilize models like BERT to aid in content curation. Since the inception of the pre-trained language models like BERT, there have also been many efforts toward adapting these pre-trained models to specific domains. However, there has not been a model specifically adapted for the education domain (particularly K-12) across subjects to the best of our knowledge. In this work, we propose to train a language model on a corpus of data curated by us across multiple subjects from various sources for K-12 education. We also evaluate our model, K12-BERT, on downstream tasks like hierarchical taxonomy tagging.
    A comparative study of non-deep learning, deep learning, and ensemble learning methods for sunspot number prediction. (arXiv:2203.05757v2 [astro-ph.SR] UPDATED)
    Solar activity has significant impacts on human activities and health. One most commonly used measure of solar activity is the sunspot number. This paper compares three important non-deep learning models, four popular deep learning models, and their five ensemble models in forecasting sunspot numbers. In particular, we propose an ensemble model called XGBoost-DL, which uses XGBoost as a two-level nonlinear ensemble method to combine the deep learning models. Our XGBoost-DL achieves the best forecasting performance (RMSE = 25.70 and MAE = 19.82) in the comparison, outperforming the best non-deep learning model SARIMA (RMSE = 54.11 and MAE = 45.51), the best deep learning model Informer (RMSE = 29.90 and MAE = 22.35) and the NASA's forecast (RMSE = 48.38 and MAE = 38.45). Our XGBoost-DL forecasts a peak sunspot number of 133.47 in May 2025 for Solar Cycle 25 and 164.62 in November 2035 for Solar Cycle 26, similar to but later than the NASA's at 137.7 in October 2024 and 161.2 in December 2034. An open-source Python package of our XGBoost-DL for the sunspot number prediction is available at https://github.com/yd1008/ts_ensemble_sunspot.
    Should You Mask 15% in Masked Language Modeling?. (arXiv:2202.08005v2 [cs.CL] UPDATED)
    Masked language models conventionally use a masking rate of 15% due to the belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40% of input tokens can outperform the 15% baseline, and even masking 80% can preserve most of the performance, as measured by finetuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models with more capacity to tackle harder tasks in particular favor higher masking rates. We also find that even more sophisticated masking schemes such as span masking or PMI masking can benefit from higher masking rates, albeit to a smaller extent. Our results contribute to a better understanding of masked language modeling and shed light on more efficient language pre-training.
    Federated Self-supervised Learning for Heterogeneous Clients. (arXiv:2205.12493v1 [cs.LG])
    Federated Learning has become an important learning paradigm due to its privacy and computational benefits. As the field advances, two key challenges that still remain to be addressed are: (1) system heterogeneity - variability in the compute and/or data resources present on each client, and (2) lack of labeled data in certain federated settings. Several recent developments have tried to overcome these challenges independently. In this work, we propose a unified and systematic framework, \emph{Heterogeneous Self-supervised Federated Learning} (Hetero-SSFL) for enabling self-supervised learning with federation on heterogeneous clients. The proposed framework allows collaborative representation learning across all the clients without imposing architectural constraints or requiring presence of labeled data. The key idea in Hetero-SSFL is to let each client train its unique self-supervised model and enable the joint learning across clients by aligning the lower dimensional representations on a common dataset. The entire training procedure could be viewed as self and peer-supervised as both the local training and the alignment procedures do not require presence of any labeled data. As in conventional self-supervised learning, the obtained client models are task independent and can be used for varied end-tasks. We provide a convergence guarantee of the proposed framework for non-convex objectives in heterogeneous settings and also empirically demonstrate that our proposed approach outperforms the state of the art methods by a significant margin.
    Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models. (arXiv:2205.12694v1 [cs.CL])
    Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models that is compatible with a wide variety of contemporary off-the-shelf hardware (unlike quantization), and that requires little additional training (unlike distillation). Pruning approaches typically take a large, accurate model as input, then attempt to discover a smaller subnetwork of that model capable of achieving end-task accuracy comparable to the full model. Inspired by previous work suggesting a connection between simpler, more generalizable models and those that lie within flat basins in the loss landscape, we propose to directly optimize for flat minima while performing task-specific pruning, which we hypothesize should lead to simpler parameterizations and thus more compressible models. In experiments combining sharpness-aware minimization with both iterative magnitude pruning and structured pruning approaches, we show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization when fine-tuning BERT models, leading to higher rates of compression with little to no loss in accuracy on the GLUE classification benchmark.
    Misleading Deep-Fake Detection with GAN Fingerprints. (arXiv:2205.12543v1 [cs.CV])
    Generative adversarial networks (GANs) have made remarkable progress in synthesizing realistic-looking images that effectively outsmart even humans. Although several detection methods can recognize these deep fakes by checking for image artifacts from the generation process, multiple counterattacks have demonstrated their limitations. These attacks, however, still require certain conditions to hold, such as interacting with the detection method or adjusting the GAN directly. In this paper, we introduce a novel class of simple counterattacks that overcomes these limitations. In particular, we show that an adversary can remove indicative artifacts, the GAN fingerprint, directly from the frequency spectrum of a generated image. We explore different realizations of this removal, ranging from filtering high frequencies to more nuanced frequency-peak cleansing. We evaluate the performance of our attack with different detection methods, GAN architectures, and datasets. Our results show that an adversary can often remove GAN fingerprints and thus evade the detection of generated images.
    Inception Transformer. (arXiv:2205.12956v1 [cs.CV])
    Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer.
    Physics Guided Machine Learning for Variational Multiscale Reduced Order Modeling. (arXiv:2205.12419v1 [physics.flu-dyn])
    We propose a new physics guided machine learning (PGML) paradigm that leverages the variational multiscale (VMS) framework and available data to dramatically increase the accuracy of reduced order models (ROMs) at a modest computational cost. The hierarchical structure of the ROM basis and the VMS framework enable a natural separation of the resolved and unresolved ROM spatial scales. Modern PGML algorithms are used to construct novel models for the interaction among the resolved and unresolved ROM scales. Specifically, the new framework builds ROM operators that are closest to the true interaction terms in the VMS framework. Finally, machine learning is used to reduce the projection error and further increase the ROM accuracy. Our numerical experiments for a two-dimensional vorticity transport problem show that the novel PGML-VMS-ROM paradigm maintains the low computational cost of current ROMs, while significantly increasing the ROM accuracy.
    Deep Aesthetic Assessment and Retrieval of Breast Cancer Treatment Outcomes. (arXiv:2205.12611v1 [cs.CV])
    Treatments for breast cancer have continued to evolve and improve in recent years, resulting in a substantial increase in survival rates, with approximately 80\% of patients having a 10-year survival period. Given the serious impact that breast cancer treatments can have on a patient's body image, consequently affecting her self-confidence and sexual and intimate relationships, it is paramount to ensure that women receive the treatment that optimizes both survival and aesthetic outcomes. Currently, there is no gold standard for evaluating the aesthetic outcome of breast cancer treatment. In addition, there is no standard way to show patients the potential outcome of surgery. The presentation of similar cases from the past would be extremely important to manage women's expectations of the possible outcome. In this work, we propose a deep neural network to perform the aesthetic evaluation. As a proof-of-concept, we focus on a binary aesthetic evaluation. Besides its use for classification, this deep neural network can also be used to find the most similar past cases by searching for nearest neighbours in the highly semantic space before classification. We performed the experiments on a dataset consisting of 143 photos of women after conservative treatment for breast cancer. The results for accuracy and balanced accuracy showed the superior performance of our proposed model compared to the state of the art in aesthetic evaluation of breast cancer treatments. In addition, the model showed a good ability to retrieve similar previous cases, with the retrieved cases having the same or adjacent class (in the 4-class setting) and having similar types of asymmetry. Finally, a qualitative interpretability assessment was also performed to analyse the robustness and trustworthiness of the model.
    Low-rank Optimal Transport: Approximation, Statistics and Debiasing. (arXiv:2205.12365v1 [stat.ML])
    The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e.g. single-cell genomics) or used to improve more complex methods (e.g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in \cite{scetbon2021lowrank} holds several promises in that regard, and was shown to complement more established entropic regularization approaches, being able to insert itself in more complex pipelines, such as quadratic OT. LOT restricts the search for low-cost couplings to those that have a low-nonnegative rank, yielding linear time algorithms in cases of interest. However, these promises can only be fulfilled if the LOT approach is seen as a legitimate contender to entropic regularization when compared on properties of interest, where the scorecard typically includes theoretical properties (statistical bounds, relation to other methods) or practical aspects (debiasing, hyperparameter tuning, initialization). We target each of these areas in this paper in order to cement the impact of low-rank approaches in computational OT.
    Ultra-compact Binary Neural Networks for Human Activity Recognition on RISC-V Processors. (arXiv:2205.12781v1 [cs.LG])
    Human Activity Recognition (HAR) is a relevant inference task in many mobile applications. State-of-the-art HAR at the edge is typically achieved with lightweight machine learning models such as decision trees and Random Forests (RFs), whereas deep learning is less common due to its high computational complexity. In this work, we propose a novel implementation of HAR based on deep neural networks, and precisely on Binary Neural Networks (BNNs), targeting low-power general purpose processors with a RISC-V instruction set. BNNs yield very small memory footprints and low inference complexity, thanks to the replacement of arithmetic operations with bit-wise ones. However, existing BNN implementations on general purpose processors impose constraints tailored to complex computer vision tasks, which result in over-parametrized models for simpler problems like HAR. Therefore, we also introduce a new BNN inference library, which targets ultra-compact models explicitly. With experiments on a single-core RISC-V processor, we show that BNNs trained on two HAR datasets obtain higher classification accuracy compared to a state-of-the-art baseline based on RFs. Furthermore, our BNN reaches the same accuracy of a RF with either less memory (up to 91%) or more energy-efficiency (up to 70%), depending on the complexity of the features extracted by the RF.
    VAEL: Bridging Variational Autoencoders and Probabilistic Logic Programming. (arXiv:2202.04178v2 [cs.PL] UPDATED)
    We present VAEL, a neuro-symbolic generative model integrating variational autoencoders (VAE) with the reasoning capabilities of probabilistic logic (L) programming. Besides standard latent subsymbolic variables, our model exploits a probabilistic logic program to define a further structured representation, which is used for logical reasoning. The entire process is end-to-end differentiable. Once trained, VAEL can solve new unseen generation tasks by (i) leveraging the previously acquired knowledge encoded in the neural component and (ii) exploiting new logical programs on the structured latent space. Our experiments provide support on the benefits of this neuro-symbolic integration both in terms of task generalization and data efficiency. To the best of our knowledge, this work is the first to propose a general-purpose end-to-end framework integrating probabilistic logic programming into a deep generative model.
    Towards a Fair Comparison and Realistic Design and Evaluation Framework of Android Malware Detectors. (arXiv:2205.12569v1 [cs.CR])
    As in other cybersecurity areas, machine learning (ML) techniques have emerged as a promising solution to detect Android malware. In this sense, many proposals employing a variety of algorithms and feature sets have been presented to date, often reporting impresive detection performances. However, the lack of reproducibility and the absence of a standard evaluation framework make these proposals difficult to compare. In this paper, we perform an analysis of 10 influential research works on Android malware detection using a common evaluation framework. We have identified five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models and their performances. In particular, we analyze the effect of (1) the presence of duplicated samples, (2) label (goodware/greyware/malware) attribution, (3) class imbalance, (4) the presence of apps that use evasion techniques and, (5) the evolution of apps. Based on this extensive experimentation, we conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results. Our findings also highlight that it is imperative to generate realistic datasets, taking into account the factors mentioned above, to enable the design and evaluation of better solutions for Android malware detection.
    Black box tests for algorithmic stability. (arXiv:2111.15546v2 [cs.LG] UPDATED)
    Algorithmic stability is a concept from learning theory that expresses the degree to which changes to the input data (e.g., removal of a single data point) may affect the outputs of a regression algorithm. Knowing an algorithm's stability properties is often useful for many downstream applications -- for example, stability is known to lead to desirable generalization properties and predictive inference guarantees. However, many modern algorithms currently used in practice are too complex for a theoretical analysis of their stability properties, and thus we can only attempt to establish these properties through an empirical exploration of the algorithm's behavior on various data sets. In this work, we lay out a formal statistical framework for this kind of "black box testing" without any assumptions on the algorithm or the data distribution, and establish fundamental bounds on the ability of any black box test to identify algorithmic stability.
    Deletion and Insertion Tests in Regression Models. (arXiv:2205.12423v1 [cs.LG])
    A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function $f$. The insertion and deletion tests of \cite{petsiuk2018rise} are used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of $f$. We find an expression for the expected value of the AUC under a random ordering of inputs to $f$ and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS). Exact computation of KS grows exponentially with dimension, while that of IG grows linearly with dimension. In two data sets including binary variables we find that KS is superior to IG in insertion and deletion tests, but only by a very small amount. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels. We show that IG will match KS when $f$ is an additive function plus a multilinear function of the variables. This includes a multilinear interpolation over the binary variables that would cause IG to have exponential cost in a naive implementation.
    VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection. (arXiv:2205.12424v1 [cs.CR])
    This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
    Generating Natural Language Proofs with Verifier-Guided Search. (arXiv:2205.12443v1 [cs.CL])
    Deductive reasoning (drawing conclusions from assumptions) is a challenging problem in NLP. In this work, we focus on proof generation: given a hypothesis and a set of supporting facts in natural language, the model generates a proof tree indicating how to deduce the hypothesis from supporting facts. Instead of generating the entire proof in one shot, prior work has demonstrated the promise of stepwise generation but achieved limited success on real-world data. Existing stepwise methods struggle to generate proof steps that are both valid and relevant. In this paper, we present a novel stepwise method NLProofS (Natural Language Proof Search), which learns to generate relevant steps conditioning on the hypothesis. At the core of our approach, we train an independent verifier to check the validity of proof steps. Instead of generating steps greedily, we search for proofs maximizing a global proof score judged by the verifier. NLProofS achieves state-of-the-art performance on EntailmentBank and RuleTaker. For example, it improves the percentage of correctly predicted proofs from 20.9% to 33.3% in the distractor setting of EntailmentBank. This is the first time stepwise methods have led to better generation of challenging human-authored proofs.
    Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech. (arXiv:2011.01174v5 [eess.AS] UPDATED)
    Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
    Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks. (arXiv:2204.02892v2 [cs.CL] UPDATED)
    The field of Natural Language Processing has experienced a dramatic leap in capabilities with the recent introduction of huge Language Models. Despite this success, natural language problems that involve several compounded steps are still practically unlearnable, even by the largest LMs. This complies with experimental failures for end-to-end learning of composite problems that were demonstrated in a variety of domains. An effective mitigation is to introduce intermediate supervision for solving sub-tasks of the compounded problem. Recently, several works have demonstrated high gains by taking a straightforward approach for incorporating intermediate supervision in compounded natural language problems: the sequence-to-sequence LM is fed with an augmented input, in which the decomposed tasks' labels are simply concatenated to the original input. In this paper, we prove a positive learning result that motivates these recent efforts. We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results. Beyond motivating contemporary empirical efforts for incorporating intermediate supervision in sequence-to-sequence language models, our positive theoretical result is the first of its kind in the landscape of results on the benefits of intermediate supervision for neural-network learning: Until now, all theoretical results on the subject are negative, i.e., show cases where learning is impossible without intermediate supervision, while our result is positive, showing that learning is facilitated in the presence of intermediate supervision.
    Beyond Impossibility: Balancing Sufficiency, Separation and Accuracy. (arXiv:2205.12327v1 [cs.LG])
    Among the various aspects of algorithmic fairness studied in recent years, the tension between satisfying both \textit{sufficiency} and \textit{separation} -- e.g. the ratios of positive or negative predictive values, and false positive or false negative rates across groups -- has received much attention. Following a debate sparked by COMPAS, a criminal justice predictive system, the academic community has responded by laying out important theoretical understanding, showing that one cannot achieve both with an imperfect predictor when there is no equal distribution of labels across the groups. In this paper, we shed more light on what might be still possible beyond the impossibility -- the existence of a trade-off means we should aim to find a good balance within it. After refining the existing theoretical result, we propose an objective that aims to balance \textit{sufficiency} and \textit{separation} measures, while maintaining similar accuracy levels. We show the use of such an objective in two empirical case studies, one involving a multi-objective framework, and the other fine-tuning of a model pre-trained for accuracy. We show promising results, where better trade-offs are achieved compared to existing alternatives.
    sat2pc: Estimating Point Cloud of Building Roofs from 2D Satellite Images. (arXiv:2205.12464v1 [cs.CV])
    Three-dimensional (3D) urban models have gained interest because of their applications in many use-cases such as urban planning and virtual reality. However, generating these 3D representations requires LiDAR data, which are not always readily available. Thus, the applicability of automated 3D model generation algorithms is limited to a few locations. In this paper, we propose sat2pc, a deep learning architecture that predicts the point cloud of a building roof from a single 2D satellite image. Our architecture combines Chamfer distance and EMD loss, resulting in better 2D to 3D performance. We extensively evaluate our model and perform ablation studies on a building roof dataset. Our results show that sat2pc was able to outperform existing baselines by at least 18.6%. Further, we show that the predicted point cloud captures more detail and geometric characteristics than other baselines.
    NECA: Network-Embedded Deep Representation Learning for Categorical Data. (arXiv:2205.12752v1 [cs.LG])
    We propose NECA, a deep representation learning method for categorical data. Built upon the foundations of network embedding and deep unsupervised representation learning, NECA deeply embeds the intrinsic relationship among attribute values and explicitly expresses data objects with numeric vector representations. Designed specifically for categorical data, NECA can support important downstream data mining tasks, such as clustering. Extensive experimental analysis demonstrated the effectiveness of NECA.
    Toward Discovering Options that Achieve Faster Planning. (arXiv:2205.12515v1 [cs.LG])
    We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. For a given set of episodic tasks and a given number of options, the objective prefers options that can be used to achieve a high return by composing few options. By composing few options, fast planning can be achieved. When faced with new tasks similar to the given ones, the discovered options are also expected to accelerate planning. Our objective extends the objective proposed by Harb et al. (2018) for the single-task setting to the multi-task setting. A closer look at Harb et al.'s objective shows that the best options discovered given one task are not likely to be useful for future unseen tasks and that the multi-task setting is indeed necessary for this purpose. In the same paper, Harb et al. also proposed an algorithm to optimize their objective, and the algorithm can be naturally extended to the multi-task setting. We empirically show that in the four-room domain the extension does not achieve a high objective value and propose a new algorithm that better optimizes the proposed objective. In the same four-room domain, we show that 1) a higher objective value is typically associated with options with which fewer planning iterations are needed to achieve near-optimal performance, 2) our new algorithm achieves a high objective value, which is close to the value achieved by a set of human-designed options, 3) the best number of planning iterations given the discovered options is much smaller and matches it obtained given human-designed options, and 4) the options produced by our algorithm also make intuitive sense because they move to and terminate at cells near hallways connecting two neighbor rooms.
    Mathematical Models of Human Drivers Using Artificial Risk Fields. (arXiv:2205.12722v1 [cs.LG])
    In this paper, we use the concept of artificial risk fields to predict how human operators control a vehicle in response to upcoming road situations. A risk field assigns a non-negative risk measure to the state of the system in order to model how close that state is to violating a safety property, such as hitting an obstacle or exiting the road. Using risk fields, we construct a stochastic model of the operator that maps from states to likely actions. We demonstrate our approach on a driving task wherein human subjects are asked to drive a car inside a realistic driving simulator while avoiding obstacles placed on the road. We show that the most likely risk field given the driving data is obtained by solving a convex optimization problem. Next, we apply the inferred risk fields to generate distinct driving behaviors while comparing predicted trajectories against ground truth measurements. We observe that the risk fields are excellent at predicting future trajectory distributions with high prediction accuracy for up to twenty seconds prediction horizons. At the same time, we observe some challenges such as the inability to account for how drivers choose to accelerate/decelerate based on the road conditions.
    Image Colorization using U-Net with Skip Connections and Fusion Layer on Landscape Images. (arXiv:2205.12867v1 [cs.CV])
    We present a novel technique to automatically colorize grayscale images that combine the U-Net model and Fusion Layer features. This approach allows the model to learn the colorization of images from pre-trained U-Net. Moreover, the Fusion layer is applied to merge local information results dependent on small image patches with global priors of an entire image on each class, forming visually more compelling colorization results. Finally, we validate our approach with a user study evaluation and compare it against state-of-the-art, resulting in improvements.
    Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization. (arXiv:2205.12648v1 [cs.LG])
    We tackle real-world problems with complex structures beyond the pixel-based game or simulator. We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph that defines a set of subtasks and their dependencies that are unknown to the agent. Different from the previous meta-rl methods trying to directly infer the unstructured task embedding, our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks, and use it as a prior to improve the task inference in testing. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks than various existing algorithms such as meta reinforcement learning, hierarchical reinforcement learning, and other heuristic agents.
    Mitigating multiple descents: A model-agnostic framework for risk monotonization. (arXiv:2205.12937v1 [math.ST])
    Recent empirical and theoretical analyses of several commonly used prediction procedures reveal a peculiar risk behavior in high dimensions, referred to as double/multiple descent, in which the asymptotic risk is a non-monotonic function of the limiting aspect ratio of the number of features or parameters to the sample size. To mitigate this undesirable behavior, we develop a general framework for risk monotonization based on cross-validation that takes as input a generic prediction procedure and returns a modified procedure whose out-of-sample prediction risk is, asymptotically, monotonic in the limiting aspect ratio. As part of our framework, we propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting, respectively, and show that, under very mild assumptions, they provably achieve monotonic asymptotic risk behavior. Our results are applicable to a broad variety of prediction procedures and loss functions, and do not require a well-specified (parametric) model. We exemplify our framework with concrete analyses of the minimum $\ell_2$, $\ell_1$-norm least squares prediction procedures. As one of the ingredients in our analysis, we also derive novel additive and multiplicative forms of oracle risk inequalities for split cross-validation that are of independent interest.
    Lyapunov function approach for approximation algorithm design and analysis: with applications in submodular maximization. (arXiv:2205.12442v1 [math.OC])
    We propose a two-phase systematical framework for approximation algorithm design and analysis via Lyapunov function. The first phase consists of using Lyapunov function as a guideline to design a continuous-time algorithm with provable approximation ratio. The second phase then converts the continuous-time algorithm to a discrete-time algorithm with the same approximation ratio and a provable time complexity. Some immediate benefits of the Lyapunov function approach include: (i) unifying many existing algorithms; (ii) providing a guideline to design and analyze new algorithms; and (iii) offer new perspectives to potentially improve existing algorithms. We use various submodular maximization problems as running examples to illustrate our framework.
    Conformal Prediction Intervals with Temporal Dependence. (arXiv:2205.12940v1 [stat.ML])
    Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time-series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure which is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time-series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.
    Imposing Gaussian Pre-Activations in a Neural Network. (arXiv:2205.12379v1 [cs.LG])
    The goal of the present work is to propose a way to modify both the initialization distribution of the weights of a neural network and its activation function, such that all pre-activations are Gaussian. We propose a family of pairs initialization/activation, where the activation functions span a continuum from bounded functions (such as Heaviside or tanh) to the identity function. This work is motivated by the contradiction between existing works dealing with Gaussian pre-activations: on one side, the works in the line of the Neural Tangent Kernels and the Edge of Chaos are assuming it, while on the other side, theoretical and experimental results challenge this hypothesis. The family of pairs initialization/activation we are proposing will help us to answer this hot question: is it desirable to have Gaussian pre-activations in a neural network?
    xFraud: Explainable Fraud Transaction Detection. (arXiv:2011.12193v3 [cs.LG] UPDATED)
    At online retail platforms, it is crucial to actively detect the risks of transactions to improve customer experience and minimize financial loss. In this work, we propose xFraud, an explainable fraud transaction prediction framework which is mainly composed of a detector and an explainer. The xFraud detector can effectively and efficiently predict the legitimacy of incoming transactions. Specifically, it utilizes a heterogeneous graph neural network to learn expressive representations from the informative heterogeneously typed entities in the transaction logs. The explainer in xFraud can generate meaningful and human-understandable explanations from graphs to facilitate further processes in the business unit. In our experiments with xFraud on real transaction networks with up to 1.1 billion nodes and 3.7 billion edges, xFraud is able to outperform various baseline models in many evaluation metrics while remaining scalable in distributed settings. In addition, we show that xFraud explainer can generate reasonable explanations to significantly assist the business analysis via both quantitative and qualitative evaluations.
    Bayesian Physics-Informed Neural Networks for real-world nonlinear dynamical systems. (arXiv:2205.08304v2 [cs.LG] UPDATED)
    Understanding real-world dynamical phenomena remains a challenging task. Across various scientific disciplines, machine learning has advanced as the go-to technology to analyze nonlinear dynamical systems, identify patterns in big data, and make decision around them. Neural networks are now consistently used as universal function approximators for data with underlying mechanisms that are incompletely understood or exceedingly complex. However, neural networks alone ignore the fundamental laws of physics and often fail to make plausible predictions. Here we integrate data, physics, and uncertainties by combining neural networks, physics-informed modeling, and Bayesian inference to improve the predictive potential of traditional neural network models. We embed the physical model of a damped harmonic oscillator into a fully-connected feed-forward neural network to explore a simple and illustrative model system, the outbreak dynamics of COVID-19. Our Physics-Informed Neural Networks can seamlessly integrate data and physics, robustly solve forward and inverse problems, and perform well for both interpolation and extrapolation, even for a small amount of noisy and incomplete data. At only minor additional cost, they can self-adaptively learn the weighting between data and physics. Combined with Bayesian Neural Networks, they can serve as priors in a Bayesian Inference, and provide credible intervals for uncertainty quantification. Our study reveals the inherent advantages and disadvantages of Neural Networks, Bayesian Inference, and a combination of both and provides valuable guidelines for model selection. While we have only demonstrated these approaches for the simple model problem of a seasonal endemic infectious disease, we anticipate that the underlying concepts and trends generalize to more complex disease conditions and, more broadly, to a wide variety of nonlinear dynamical systems.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v2 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other.
    Learning dynamics from partial observations with structured neural ODEs. (arXiv:2205.12550v1 [eess.SY])
    Identifying dynamical systems from experimental data is a notably difficult task. Prior knowledge generally helps, but the extent of this knowledge varies with the application, and customized models are often needed. We propose a flexible framework to incorporate a broad spectrum of physical insight into neural ODE-based system identification, giving physical interpretability to the resulting latent space. This insight is either enforced through hard constraints in the optimization problem or added in its cost function. In order to link the partial and possibly noisy observations to the latent state, we rely on tools from nonlinear observer theory to build a recognition model. We demonstrate the performance of the proposed approach on numerical simulations and on an experimental dataset from a robotic exoskeleton.
    Heterogeneous Reservoir Computing Models for Persian Speech Recognition. (arXiv:2205.12594v1 [cs.SD])
    Over the last decade, deep-learning methods have been gradually incorporated into conventional automatic speech recognition (ASR) frameworks to create acoustic, pronunciation, and language models. Although it led to significant improvements in ASRs' recognition accuracy, due to their hard constraints related to hardware requirements (e.g., computing power and memory usage), it is unclear if such approaches are the most computationally- and energy-efficient options for embedded ASR applications. Reservoir computing (RC) models (e.g., echo state networks (ESNs) and liquid state machines (LSMs)), on the other hand, have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies. However, their performance in speech processing tasks is relatively inferior to that of the deep-learning-based models. To enhance the accuracy of the RC in ASR applications, we propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales. To test our models, we performed a speech recognition task on the Farsdat Persian dataset. Since, to the best of our knowledge, standard RC has not yet been employed to conduct any Persian ASR tasks, we also trained conventional single-layer and deep ESNs to provide baselines for comparison. Besides, we compared the RC performance with a standard long-short-term memory (LSTM) model. Heterogeneous RC models (1) show improved performance to the standard RC models; (2) perform on par in terms of recognition accuracy with the LSTM, and (3) reduce the training time considerably.
    FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech. (arXiv:2205.12446v1 [cs.CL])
    We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
    MEKER: Memory Efficient Knowledge Embedding Representation for Link Prediction and Question Answering. (arXiv:2204.10629v2 [cs.CL] UPDATED)
    Knowledge Graphs (KGs) are symbolically structured storages of facts. The KG embedding contains concise data used in NLP tasks requiring implicit information about the real world. Furthermore, the size of KGs that may be useful in actual NLP assignments is enormous, and creating embedding over it has memory cost issues. We represent KG as a 3rd-order binary tensor and move beyond the standard CP decomposition by using a data-specific generalized version of it. The generalization of the standard CP-ALS algorithm allows obtaining optimization gradients without a backpropagation mechanism. It reduces the memory needed in training while providing computational benefits. We propose a MEKER, a memory-efficient KG embedding model, which yields SOTA-comparable performance on link prediction tasks and KG-based Question Answering.
    Learn2Agree: Fitting with Multiple Annotators without Objective Ground Truth. (arXiv:2109.03596v2 [cs.LG] UPDATED)
    The annotation of domain experts is important for some medical applications where the objective ground truth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single ground truth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel Learning to Agreement (Learn2Agree) framework to tackle the challenge of learning from multiple annotators without objective ground truth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between annotators. The proposed method can be easily added to existing backbones, with experiments on two medical datasets showed better agreement levels with annotators.
    Uncertainty Quantification for Transport in Porous media using Parameterized Physics Informed neural Networks. (arXiv:2205.12730v1 [cs.CE])
    We present a Parametrization of the Physics Informed Neural Network (P-PINN) approach to tackle the problem of uncertainty quantification in reservoir engineering problems. We demonstrate the approach with the immiscible two phase flow displacement (Buckley-Leverett problem) in heterogeneous porous medium. The reservoir properties (porosity, permeability) are treated as random variables. The distribution of these properties can affect dynamic properties such as the fluids saturation, front propagation speed or breakthrough time. We explore and use to our advantage the ability of networks to interpolate complex high dimensional functions. We observe that the additional dimensions resulting from a stochastic treatment of the partial differential equations tend to produce smoother solutions on quantities of interest (distributions parameters) which is shown to improve the performance of PINNS. We show that provided a proper parameterization of the uncertainty space, PINN can produce solutions that match closely both the ensemble realizations and the stochastic moments. We demonstrate applications for both homogeneous and heterogeneous fields of properties. We are able to solve problems that can be challenging for classical methods. This approach gives rise to trained models that are both more robust to variations in the input space and can compete in performance with traditional stochastic sampling methods.
    An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation. (arXiv:2205.12770v1 [cs.LG])
    Gradient descent or its variants are popular in training neural networks. However, in deep Q-learning with neural network approximation, a type of reinforcement learning, gradient descent (also known as Residual Gradient (RG)) is barely used to solve Bellman residual minimization problem. On the contrary, Temporal Difference (TD), an incomplete gradient descent method prevails. In this work, we perform extensive experiments to show that TD outperforms RG, that is, when the training leads to a small Bellman residual error, the solution found by TD has a better policy and is more robust against the perturbation of neural network parameters. We further use experiments to reveal a key difference between reinforcement learning and supervised learning, that is, a small Bellman residual error can correspond to a bad policy in reinforcement learning while the test loss function in supervised learning is a standard index to indicate the performance. We also empirically examine that the missing term in TD is a key reason why RG performs badly. Our work shows that the performance of a deep Q-learning solution is closely related to the training dynamics and how an incomplete gradient descent method can find a good policy is interesting for future study.
    FBNETGEN: Task-aware GNN-based fMRI Analysis via Functional Brain Network Generation. (arXiv:2205.12465v1 [cs.LG])
    Functional magnetic resonance imaging (fMRI) is one of the most common imaging modalities to investigate brain functions. Recent studies in neuroscience stress the great potential of functional brain networks constructed from fMRI data for clinical predictions. Traditional functional brain networks, however, are noisy and unaware of downstream prediction tasks, while also incompatible with the deep graph neural network (GNN) models. In order to fully unleash the power of GNNs in network-based fMRI analysis, we develop FBNETGEN, a task-aware and interpretable fMRI analysis framework via deep brain network generation. In particular, we formulate (1) prominent region of interest (ROI) features extraction, (2) brain networks generation, and (3) clinical predictions with GNNs, in an end-to-end trainable model under the guidance of particular prediction tasks. Along with the process, the key novel component is the graph generator which learns to transform raw time-series features into task-oriented brain networks. Our learnable graphs also provide unique interpretations by highlighting prediction-related brain regions. Comprehensive experiments on two datasets, i.e., the recently released and currently largest publicly available fMRI dataset Adolescent Brain Cognitive Development (ABCD), and the widely-used fMRI dataset PNC, prove the superior effectiveness and interpretability of FBNETGEN. The implementation is available at https://github.com/Wayfear/FBNETGEN.}
    Multi-Head Online Learning for Delayed Feedback Modeling. (arXiv:2205.12406v1 [cs.LG])
    In online advertising, it is highly important to predict the probability and the value of a conversion (e.g., a purchase). It not only impacts user experience by showing relevant ads, but also affects ROI of advertisers and revenue of marketplaces. Unlike clicks, which often occur within minutes after impressions, conversions are expected to happen over a long period of time (e.g., 30 days for online shopping). It creates a challenge, as the true labels are only available after the long delays. Either inaccurate labels (partial conversions) are used, or models are trained on stale data (e.g., from 30 days ago). The problem is more eminent in online learning, which focuses on the live performance on the latest data. In this paper, a novel solution is presented to address this challenge using multi-head modeling. Unlike traditional methods, it directly quantizes conversions into multiple windows, such as day 1, day 2, day 3-7, and day 8-30. A sub-model is trained specifically on conversions within each window. Label freshness is maximally preserved in early models (e.g., day 1 and day 2), while late conversions are accurately utilized in models with longer delays (e.g., day 8-30). It is shown to greatly exceed the performance of known methods in online learning experiments for both conversion rate (CVR) and value per click (VPC) predictions. Lastly, as a general method for delayed feedback modeling, it can be combined with any advanced ML techniques to further improve the performance.
    DPSNN: A Differentially Private Spiking Neural Network. (arXiv:2205.12718v1 [cs.NE])
    Privacy-preserving is a key problem for the machine learning algorithm. Spiking neural network (SNN) plays an important role in many domains, such as image classification, object detection, and speech recognition, but the study on the privacy protection of SNN is urgently needed. This study combines the differential privacy (DP) algorithm and SNN and proposes differentially private spiking neural network (DPSNN). DP injects noise into the gradient, and SNN transmits information in discrete spike trains so that our differentially private SNN can maintain strong privacy protection while still ensuring high accuracy. We conducted experiments on MNIST, Fashion-MNIST, and the face recognition dataset Extended YaleB. When the privacy protection is improved, the accuracy of the artificial neural network(ANN) drops significantly, but our algorithm shows little change in performance. Meanwhile, we analyzed different factors that affect the privacy protection of SNN. Firstly, the less precise the surrogate gradient is, the better the privacy protection of the SNN. Secondly, the Integrate-And-Fire (IF) neurons perform better than leaky Integrate-And-Fire (LIF) neurons. Thirdly, a large time window contributes more to privacy protection and performance.
    Analytics of Business Time Series Using Machine Learning and Bayesian Inference. (arXiv:2205.12905v1 [cs.LG])
    In the survey we consider the case studies on sales time series forecasting, the deep learning approach for forecasting non-stationary time series using time trend correction, dynamic price and supply optimization using Q-learning, Bitcoin price modeling, COVID-19 spread impact on stock market, using social networks signals in analytics. The use of machine learning and Bayesian inference in predictive analytics has been analyzed.
    A Convergence Theory for Over-parameterized Variational Quantum Eigensolvers. (arXiv:2205.12481v1 [quant-ph])
    The Variational Quantum Eigensolver (VQE) is a promising candidate for quantum applications on near-term Noisy Intermediate-Scale Quantum (NISQ) computers. Despite a lot of empirical studies and recent progress in theoretical understanding of VQE's optimization landscape, the convergence for optimizing VQE is far less understood. We provide the first rigorous analysis of the convergence of VQEs in the over-parameterization regime. By connecting the training dynamics with the Riemannian Gradient Flow on the unit-sphere, we establish a threshold on the sufficient number of parameters for efficient convergence, which depends polynomially on the system dimension and the spectral ratio, a property of the problem Hamiltonian, and could be resilient to gradient noise to some extent. We further illustrate that this overparameterization threshold could be vastly reduced for specific VQE instances by establishing an ansatz-dependent threshold paralleling our main result. We showcase that our ansatz-dependent threshold could serve as a proxy of the trainability of different VQE ansatzes without performing empirical experiments, which hence leads to a principled way of evaluating ansatz design. Finally, we conclude with a comprehensive empirical study that supports our theoretical findings.
    Recipe2Vec: Multi-modal Recipe Representation Learning with Graph Neural Networks. (arXiv:2205.12396v1 [cs.LG])
    Learning effective recipe representations is essential in food studies. Unlike what has been developed for image-based recipe retrieval or learning structural text embeddings, the combined effect of multi-modal information (i.e., recipe images, text, and relation data) receives less attention. In this paper, we formalize the problem of multi-modal recipe representation learning to integrate the visual, textual, and relational information into recipe embeddings. In particular, we first present Large-RG, a new recipe graph data with over half a million nodes, making it the largest recipe graph to date. We then propose Recipe2Vec, a novel graph neural network based recipe embedding model to capture multi-modal information. Additionally, we introduce an adversarial attack strategy to ensure stable learning and improve performance. Finally, we design a joint objective function of node classification and adversarial learning to optimize the model. Extensive experiments demonstrate that Recipe2Vec outperforms state-of-the-art baselines on two classic food study tasks, i.e., cuisine category classification and region prediction. Dataset and codes are available at https://github.com/meettyj/Recipe2Vec.
    Impartial Games: A Challenge for Reinforcement Learning. (arXiv:2205.12787v1 [cs.LG])
    The AlphaZero algorithm and its successor MuZero have revolutionised several competitive strategy games, including chess, Go, and shogi and video games like Atari, by learning to play these games better than any human and any specialised computer program. Aside from knowing the rules, AlphaZero had no prior knowledge of each game. This dramatically advanced progress on a long-standing AI challenge to create programs that can learn for themselves from first principles. Theoretically, there are well-known limits to the power of deep learning for strategy games like chess, Go, and shogi, as they are known to be NEXPTIME hard. Some papers have argued that the AlphaZero methodology has limitations and is unsuitable for general AI. However, none of these works has suggested any specific limits for any particular game. In this paper, we provide more powerful bottlenecks than previously suggested. We present the first concrete example of a game - namely the (children) game of nim - and other impartial games that seem to be a stumbling block for AlphaZero and similar reinforcement learning algorithms. We show experimentally that the bottlenecks apply to both the policy and value networks. Since solving nim can be done in linear time using logarithmic space i.e. has very low-complexity, our experimental results supersede known theoretical limits based on many games' PSPACE (and NEXPTIME) completeness. We show that nim can be learned on small boards, but when the board size increases, AlphaZero style algorithms rapidly fail to improve. We quantify the difficulties for various setups, parameter settings and computational resources. Our results might help expand the AlphaZero self-play paradigm by allowing it to use meta-actions during training and/or actual game play like applying abstract transformations, or reading and writing to an external memory.
    The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training. (arXiv:2205.12502v1 [cs.CV])
    Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the generated dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe strong performance gains in the low-data regime (up to 9.35 absolute points on NDCG).
    Investigating Information Inconsistency in Multilingual Open-Domain Question Answering. (arXiv:2205.12456v1 [cs.CL])
    Retrieval based open-domain QA systems use retrieved documents and answer-span selection over retrieved documents to find best-answer candidates. We hypothesize that multilingual Question Answering (QA) systems are prone to information inconsistency when it comes to documents written in different languages, because these documents tend to provide a model with varying information about the same topic. To understand the effects of the biased availability of information and cultural influence, we analyze the behavior of multilingual open-domain question answering models with a focus on retrieval bias. We analyze if different retriever models present different passages given the same question in different languages on TyDi QA and XOR-TyDi QA, two multilingualQA datasets. We speculate that the content differences in documents across languages might reflect cultural divergences and/or social biases.
    Graph-Based Similarity of Neural Network Representations. (arXiv:2111.11165v2 [cs.LG] UPDATED)
    Understanding the black-box representations in Deep Neural Networks (DNN) is an essential problem in deep learning. In this work, we propose Graph-Based Similarity (GBS) to measure the similarity of layer features. Contrary to previous works that compute the similarity directly on the feature maps, GBS measures the correlation based on the graph constructed with hidden layer outputs. By treating each input sample as a node and the corresponding layer output similarity as edges, we construct the graph of DNN representations for each layer. The similarity between graphs of layers identifies the correspondences between representations of models trained in different datasets and initializations. We demonstrate and prove the invariance property of GBS, including invariance to orthogonal transformation and invariance to isotropic scaling, and compare GBS with CKA. GBS shows state-of-the-art performance in reflecting the similarity and provides insights on explaining the adversarial sample behavior on the hidden layer space.
    The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning. (arXiv:2203.06498v7 [cs.LG] UPDATED)
    Recent arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences. They also inspire calls for the integration of statistical approaches to causal inference and predictive modeling. A deeper understanding of what reproducibility concerns in supervised ML research have in common with the replication crisis in experimental science puts the new concerns in perspective, and helps researchers avoid "the worst of both worlds," where ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in causal attribution as exemplified in psychology versus predictive modeling as exemplified in ML. We identify themes that re-occur in reform discussions, like overreliance on asymptotic theory and non-credible beliefs about real-world data generating processes. We argue that in both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often impossible to refute due to undisclosed sources of variance in the learning pipeline. In particular, errors being acknowledged in ML expose cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally represent uncertainty in performance claims. We conclude by discussing risks that arise when sources of errors are misdiagnosed and the need to acknowledge the role of human inductive biases in learning and reform.
    Trust-based Consensus in Multi-Agent Reinforcement Learning Systems. (arXiv:2205.12880v1 [cs.MA])
    An often neglected issue in multi-agent reinforcement learning (MARL) is the potential presence of unreliable agents in the environment whose deviations from expected behavior can prevent a system from accomplishing its intended tasks. In particular, consensus is a fundamental underpinning problem of cooperative distributed multi-agent systems. Consensus requires different agents, situated in a decentralized communication network, to reach an agreement out of a set of initial proposals that they put forward. Learning-based agents should adopt a protocol that allows them to reach consensus despite having one or more unreliable agents in the system. This paper investigates the problem of unreliable agents in MARL, considering consensus as case study. Echoing established results in the distributed systems literature, our experiments show that even a moderate fraction of such agents can greatly impact the ability of reaching consensus in a networked environment. We propose Reinforcement Learning-based Trusted Consensus (RLTC), a decentralized trust mechanism, in which agents can independently decide which neighbors to communicate with. We empirically demonstrate that our trust mechanism is able to deal with unreliable agents effectively, as evidenced by higher consensus success rates.
    MGX: Near-Zero Overhead Memory Protection for Data-Intensive Accelerators. (arXiv:2004.09679v2 [cs.CR] UPDATED)
    This paper introduces MGX, a near-zero overhead memory protection scheme for hardware accelerators. MGX minimizes the performance overhead of off-chip memory encryption and integrity verification by exploiting the application-specific properties of the accelerator execution. In particular, accelerators tend to explicitly manage data movement between on-chip and off-chip memories. Therefore, the general memory access pattern of an accelerator can largely be determined for a given application. Exploiting these characteristics, MGX generates version numbers used in memory encryption and integrity verification using on-chip accelerator state rather than storing them in the off-chip memory; it also customizes the granularity of the memory protection to match the granularity used by the accelerator. To demonstrate the efficacy of MGX, we present an in-depth study of MGX for DNN and graph algorithms. Experimental results show that on average, MGX lowers the performance overhead of memory protection from 28% and 33% to 4% and 5% for DNN and graph processing accelerators in a wide range of benchmarks, respectively.
    RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning. (arXiv:2205.12548v1 [cs.CL])
    Prompting has shown impressive success in enabling large pretrained language models (LMs) to perform diverse NLP tasks, especially when only few downstream data are available. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning soft prompt (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. Discrete prompt, on the other hand, is difficult to optimize, and is often created by "enumeration (e.g., paraphrasing)-then-selection" heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward. To overcome the complexity and stochasticity of reward signals by the large LM environment, we incorporate effective reward stabilization that substantially enhances the training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating LM prompting may not follow human language patterns.
    Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models. (arXiv:2205.12428v1 [cs.LG])
    Knowledge Distillation (KD) is a prominent neural model compression technique which heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need the teacher labels to fine-tune smaller PLM student networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.
    Multimodal active speaker detection and virtual cinematography for video conferencing. (arXiv:2002.03977v3 [eess.AS] UPDATED)
    Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.
    A Tree-Structured Multi-Task Model Recommender. (arXiv:2203.05092v2 [cs.LG] UPDATED)
    Tree-structured multi-task architectures have been employed to jointly tackle multiple vision tasks in the context of multi-task learning (MTL). The major challenge is to determine where to branch out for each task given a backbone model to optimize for both task accuracy and computation efficiency. To address the challenge, this paper proposes a recommender that, given a set of tasks and a convolutional neural network-based backbone model, automatically suggests tree-structured multi-task architectures that could achieve a high task performance while meeting a user-specified computation budget without performing model training. Extensive evaluations on popular MTL benchmarks show that the recommended architectures could achieve competitive task accuracy and computation efficiency compared with state-of-the-art MTL methods. Our tree-structured multi-task model recommender is open-sourced and available at https://github.com/zhanglijun95/TreeMTL.
    Exact Phase Transitions in Deep Learning. (arXiv:2205.12510v1 [cs.LG])
    This work reports deep-learning-unique first-order and second-order phase transitions, whose phenomenology closely follows that in statistical physics. In particular, we prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer. The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.
    Meta-Learning-Based Robust Adaptive Flight Control Under Uncertain Wind Conditions. (arXiv:2103.01932v3 [cs.RO] UPDATED)
    Realtime model learning proves challenging for complex dynamical systems, such as drones flying in variable wind conditions. Machine learning technique such as deep neural networks have high representation power but is often too slow to update onboard. On the other hand, adaptive control relies on simple linear parameter models can update as fast as the feedback control loop. We propose an online composite adaptation method that treats outputs from a deep neural network as a set of basis functions capable of representing different wind conditions. To help with training, meta-learning techniques are used to optimize the network output useful for adaptation. We validate our approach by flying a drone in an open air wind tunnel under varying wind conditions and along challenging trajectories. We compare the result with other adaptive controller with different basis function sets and show improvement over tracking and prediction errors.
    Linear Algorithms for Nonparametric Multiclass Probability Estimation. (arXiv:2205.12460v1 [stat.ME])
    Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) have been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wang, Shen and Liu, 2008; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demand polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.
    Memorization in NLP Fine-tuning Methods. (arXiv:2205.12506v1 [cs.CL])
    Large language models are shown to present privacy risks through memorization of training data, and several recent works have studied such risks for the pre-training phase. Little attention, however, has been given to the fine-tuning phase and it is not well understood how different fine-tuning methods (such as fine-tuning the full model, the model head, and adapter) compare in terms of memorization risk. This presents increasing concern as the "pre-train and fine-tune" paradigm proliferates. In this paper, we empirically study memorization of fine-tuning methods using membership inference and extraction attacks, and show that their susceptibility to attacks is very different. We observe that fine-tuning the head of the model has the highest susceptibility to attacks, whereas fine-tuning smaller adapters appears to be less vulnerable to known extraction attacks.
    The Web Is Your Oyster -- Knowledge-Intensive NLP against a Very Large Web Corpus. (arXiv:2112.09924v2 [cs.CL] UPDATED)
    In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval from Sphere enables a state of the art system to match and even outperform Wikipedia-based models on several tasks. We also observe that while a dense index can outperform a sparse BM25 baseline on Wikipedia, on Sphere this is not yet possible. To facilitate further research and minimise the community's reliance on proprietary, black-box search engines, we share our indices, evaluation metrics and infrastructure.
    Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. (arXiv:2202.04579v2 [cs.LG] UPDATED)
    Cellular sheaves equip graphs with a "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain state-of-the-art results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.
    Training Language Models with Memory Augmentation. (arXiv:2205.12674v1 [cs.CL])
    Recent work has improved language models remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce memories at testing time, or represent them using a separately trained encoder -- resulting in sub-optimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training language models with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories -- local, long-term, and external memory -- at testing time. We evaluate our approach on multiple language modeling and machine translation benchmarks. We find that simply replacing the vanilla language modeling objective by ours greatly reduces the perplexity, without modifying the model architecture or incorporating extra context (e.g., 18.70 $\to$ 17.76 on WikiText-103). We further augment language models with long-range contexts and external knowledge and demonstrate significant gains over previous memory-augmented approaches.
    lpSpikeCon: Enabling Low-Precision Spiking Neural Network Processing for Efficient Unsupervised Continual Learning on Autonomous Agents. (arXiv:2205.12295v1 [cs.NE])
    Recent advances have shown that SNN-based systems can efficiently perform unsupervised continual learning due to their bio-plausible learning rule, e.g., Spike-Timing-Dependent Plasticity (STDP). Such learning capabilities are especially beneficial for use cases like autonomous agents (e.g., robots and UAVs) that need to continuously adapt to dynamically changing scenarios/environments, where new data gathered directly from the environment may have novel features that should be learned online. Current state-of-the-art works employ high-precision weights (i.e., 32 bit) for both training and inference phases, which pose high memory and energy costs thereby hindering efficient embedded implementations of such systems for battery-driven mobile autonomous systems. On the other hand, precision reduction may jeopardize the quality of unsupervised continual learning due to information loss. Towards this, we propose lpSpikeCon, a novel methodology to enable low-precision SNN processing for efficient unsupervised continual learning on resource-constrained autonomous agents/systems. Our lpSpikeCon methodology employs the following key steps: (1) analyzing the impacts of training the SNN model under unsupervised continual learning settings with reduced weight precision on the inference accuracy; (2) leveraging this study to identify SNN parameters that have a significant impact on the inference accuracy; and (3) developing an algorithm for searching the respective SNN parameter values that improve the quality of unsupervised continual learning. The experimental results show that our lpSpikeCon can reduce weight memory of the SNN model by 8x (i.e., by judiciously employing 4-bit weights) for performing online training with unsupervised continual learning and achieve no accuracy loss in the inference phase, as compared to the baseline model with 32-bit weights across different network sizes.
    Linear Connectivity Reveals Generalization Strategies. (arXiv:2205.12411v1 [cs.LG])
    It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.
    FreDo: Frequency Domain-based Long-Term Time Series Forecasting. (arXiv:2205.12301v1 [cs.LG])
    The ability to forecast far into the future is highly beneficial to many applications, including but not limited to climatology, energy consumption, and logistics. However, due to noise or measurement error, it is questionable how far into the future one can reasonably predict. In this paper, we first mathematically show that due to error accumulation, sophisticated models might not outperform baseline models for long-term forecasting. To demonstrate, we show that a non-parametric baseline model based on periodicity can actually achieve comparable performance to a state-of-the-art Transformer-based model on various datasets. We further propose FreDo, a frequency domain-based neural network model that is built on top of the baseline model to enhance its performance and which greatly outperforms the state-of-the-art model. Finally, we validate that the frequency domain is indeed better by comparing univariate models trained in the frequency v.s. time domain.
    Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models. (arXiv:2205.12746v1 [q-fin.CP])
    This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v1 [cs.LG])
    We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.
    CGX: Adaptive System Support for Communication-Efficient Deep Learning. (arXiv:2111.08617v4 [cs.DC] UPDATED)
    The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
    Machine learning methods for Schlieren imaging of a plasma channel in tenuous atomic vapor. (arXiv:2205.12731v1 [physics.plasm-ph])
    We investigate the usage of a Schlieren imaging setup to measure the geometrical dimensions of a plasma channel in atomic vapor. Near resonant probe light is used to image the plasma channel in a tenuous vapor and machine learning techniques are tested for extracting quantitative information from the images. By building a database of simulated signals with a range of plasma parameters for training Deep Neural Networks, we demonstrate that they can extract from the Schlieren images reliably and with high accuracy the location, the radius and the maximum ionization fraction of the plasma channel as well as the width of the transition region between the core of the plasma channel and the unionized vapor. We test several different neural network architectures with supervised learning and show that the parameter estimations supplied by the networks are resilient with respect to slight changes of the experimental parameters that may occur in the course of a measurement.
    Scalable Online Change Detection for High-dimensional Data Streams. (arXiv:2205.12706v1 [cs.LG])
    Detecting changes in data streams is a core objective in their analysis and has applications in, say, predictive maintenance, fraud detection, and medicine. A principled approach to detect changes is to compare distributions observed within the stream to each other. However, data streams often are high-dimensional, and changes can be complex, e.g., only manifest themselves in higher moments. The streaming setting also imposes heavy memory and computation restrictions. We propose an algorithm, Maximum Mean Discrepancy Adaptive Windowing (MMDAW), which leverages the well-known Maximum Mean Discrepancy (MMD) two-sample test, and facilitates its efficient online computation on windows whose size it flexibly adapts. As MMD is sensitive to any change in the underlying distribution, our algorithm is a general-purpose non-parametric change detector that fulfills the requirements imposed by the streaming setting. Our experiments show that MMDAW achieves better detection quality than state-of-the-art competitors.
    First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual Information Maximization. (arXiv:2205.12381v1 [cs.LG])
    How can we train an assistive human-machine interface (e.g., an electromyography-based limb prosthesis) to translate a user's raw command signals into the actions of a robot or computer when there is no prior mapping, we cannot ask the user for supervision in the form of action labels or reward feedback, and we do not have prior knowledge of the tasks the user is trying to accomplish? The key idea in this paper is that, regardless of the task, when an interface is more intuitive, the user's commands are less noisy. We formalize this idea as a completely unsupervised objective for optimizing interfaces: the mutual information between the user's command signals and the induced state transitions in the environment. To evaluate whether this mutual information score can distinguish between effective and ineffective interfaces, we conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games. The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains, with an average Spearman's rank correlation of 0.43. In addition to offline evaluation of existing interfaces, we use our unsupervised objective to learn an interface from scratch: we randomly initialize the interface, have the user attempt to perform their desired tasks using the interface, measure the mutual information score, and update the interface to maximize mutual information through reinforcement learning. We evaluate our method through a user study with 12 participants who perform a 2D cursor control task using a perturbed mouse, and an experiment with one user playing the Lunar Lander game using hand gestures. The results show that we can learn an interface from scratch, without any user supervision or prior knowledge of tasks, in under 30 minutes.
    An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation. (arXiv:2205.12753v1 [cs.CV])
    The performance of machine learning models under distribution shift has been the focus of the community in recent years. Most of current methods have been proposed to improve the robustness to distribution shift from the algorithmic perspective, i.e., designing better training algorithms to help the generalization in shifted test distributions. This paper studies the distribution shift problem from the perspective of pre-training and data augmentation, two important factors in the practice of deep learning that have not been systematically investigated by existing work. By evaluating seven pre-trained models, including ResNets and ViT's with self-supervision and supervision mode, on five important distribution-shift datasets, from WILDS and DomainBed benchmarks, with five different learning algorithms, we provide the first comprehensive empirical study focusing on pre-training and data augmentation. With our empirical result obtained from 1,330 models, we provide the following main observations: 1) ERM combined with data augmentation can achieve state-of-the-art performance if we choose a proper pre-trained model respecting the data property; 2) specialized algorithms further improve the robustness on top of ERM when handling a specific type of distribution shift, e.g., GroupDRO for spurious correlation and CORAL for large-scale out-of-distribution data; 3) Comparing different pre-training modes, architectures and data sizes, we provide novel observations about pre-training on distribution shift, which sheds light on designing or selecting pre-training strategy for different kinds of distribution shifts. In summary, our empirical study provides a comprehensive baseline for a wide range of pre-training models fine-tuned with data augmentation, which potentially inspires research exploiting the power of pre-training and data augmentation in the future of distribution shift study.
    An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems. (arXiv:2205.12755v1 [cs.LG])
    Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69image classification tasks, for example achieving the best test accuracy reported fora model trained only on public data for competitive tasks such as cifar10: 99.43%.
    MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge. (arXiv:2205.12660v1 [cs.LG])
    Deep neural network (DNN) latency characterization is a time-consuming process and adds significant cost to Neural Architecture Search (NAS) processes when searching for efficient convolutional neural networks for embedded vision applications. DNN Latency is a hardware dependent metric and requires direct measurement or inference on target hardware. A recently introduced latency estimation technique known as MAPLE predicts DNN execution time on previously unseen hardware devices by using hardware performance counters. Leveraging these hardware counters in the form of an implicit prior, MAPLE achieves state-of-the-art performance in latency prediction. Here, we propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency to better account for model stability and robustness. First, by identifying DNN architectures that exhibit a similar latency to each other, we can generate multiple virtual examples to significantly improve the accuracy over MAPLE. Secondly, the hardware specifications are used to determine the similarity between training and test hardware to emphasize training samples captured from comparable devices (domains) and encourages improved domain alignment. Experimental results using a convolution neural network NAS benchmark across different types of devices, including an Intel processor that is now used for embedded vision applications, demonstrate a 5% improvement over MAPLE and 9% over HELP. Furthermore, we include ablation studies to independently assess the benefits of virtual examples and hardware-based sample importance.
    Additive Logistic Mechanism for Privacy-Preserving Self-Supervised Learning. (arXiv:2205.12430v1 [cs.LG])
    We study the privacy risks that are associated with training a neural network's weights with self-supervised learning algorithms. Through empirical evidence, we show that the fine-tuning stage, in which the network weights are updated with an informative and often private dataset, is vulnerable to privacy attacks. To address the vulnerabilities, we design a post-training privacy-protection algorithm that adds noise to the fine-tuned weights and propose a novel differential privacy mechanism that samples noise from the logistic distribution. Compared to the two conventional additive noise mechanisms, namely the Laplace and the Gaussian mechanisms, the proposed mechanism uses a bell-shaped distribution that resembles the distribution of the Gaussian mechanism, and it satisfies pure $\epsilon$-differential privacy similar to the Laplace mechanism. We apply membership inference attacks on both unprotected and protected models to quantify the trade-off between the models' privacy and performance. We show that the proposed protection algorithm can effectively reduce the attack accuracy to roughly 50\%-equivalent to random guessing-while maintaining a performance loss below 5\%.
    PLAtE: A Large-scale Dataset for List Page Web Extraction. (arXiv:2205.12386v1 [cs.CL])
    Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items. PLAtE encompasses both the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 53, 905 items from 6, 810 pages, making it the first large-scale list page web extraction dataset. We construct PLAtE by collecting list pages from Common Crawl, then annotating them on Mechanical Turk. Quantitative and qualitative analyses are performed to demonstrate PLAtE has high-quality annotations. We establish strong baseline performance on PLAtE with a SOTA model achieving an F1-score of 0.750 for attribute classification and 0.915 for segmentation, indicating opportunities for future research innovations in web extraction.
    A Universal Error Measure for Input Predictions Applied to Online Graph Problems. (arXiv:2205.12850v1 [cs.DS])
    We introduce a novel measure for quantifying the error in input predictions. The error is based on a minimum-cost hyperedge cover in a suitably defined hypergraph and provides a general template which we apply to online graph problems. The measure captures errors due to absent predicted requests as well as unpredicted actual requests; hence, predicted and actual inputs can be of arbitrary size. We achieve refined performance guarantees for previously studied network design problems in the online-list model, such as Steiner tree and facility location. Further, we initiate the study of learning-augmented algorithms for online routing problems, such as the traveling salesperson problem and dial-a-ride problem, where (transportation) requests arrive over time (online-time model). We provide a general algorithmic framework and we give error-dependent performance bounds that improve upon known worst-case barriers, when given accurate predictions, at the cost of slightly increased worst-case bounds when given predictions of arbitrary quality.
  • Open

    Algorithms for the Communication of Samples. (arXiv:2110.12805v3 [cs.IT] UPDATED)
    The efficient communication of noisy data has applications in several areas of machine learning, such as neural compression or differential privacy, and is also known as reverse channel coding or the channel simulation problem. Here we propose two new coding schemes with practical advantages over existing approaches. First, we introduce ordered random coding (ORC) which uses a simple trick to reduce the coding cost of previous approaches. This scheme further illuminates a connection between schemes based on importance sampling and the so-called Poisson functional representation. Second, we describe a hybrid coding scheme which uses dithered quantization to more efficiently communicate samples from distributions with bounded support.
    A Neural Tangent Kernel Formula for Ensembles of Soft Trees with Arbitrary Architectures. (arXiv:2205.12904v1 [cs.LG])
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although it can have various tree architectures, the theoretical properties of their impact are not well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with infinitely many trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Deletion and Insertion Tests in Regression Models. (arXiv:2205.12423v1 [cs.LG])
    A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function $f$. The insertion and deletion tests of \cite{petsiuk2018rise} are used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of $f$. We find an expression for the expected value of the AUC under a random ordering of inputs to $f$ and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS). Exact computation of KS grows exponentially with dimension, while that of IG grows linearly with dimension. In two data sets including binary variables we find that KS is superior to IG in insertion and deletion tests, but only by a very small amount. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels. We show that IG will match KS when $f$ is an additive function plus a multilinear function of the variables. This includes a multilinear interpolation over the binary variables that would cause IG to have exponential cost in a naive implementation.
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v3 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v2 [stat.ML] UPDATED)
    We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Learning Distributions by Generative Adversarial Networks: Approximation and Generalization. (arXiv:2205.12601v1 [cs.LG])
    We study how well generative adversarial networks (GAN) learn probability distributions from finite samples by analyzing the convergence rates of these models. Our analysis is based on a new oracle inequality that decomposes the estimation error of GAN into the discriminator and generator approximation errors, generalization error and optimization error. To estimate the discriminator approximation error, we establish error bounds on approximating H\"older functions by ReLU neural networks, with explicit upper bounds on the Lipschitz constant of the network or norm constraint on the weights. For generator approximation error, we show that neural network can approximately transform a low-dimensional source distribution to a high-dimensional target distribution and bound such approximation error by the width and depth of neural network. Combining the approximation results with generalization bounds of neural networks from statistical learning theory, we establish the convergence rates of GANs in various settings, when the error is measured by a collection of integral probability metrics defined through H\"older classes, including the Wasserstein distance as a special case. In particular, for distributions concentrated around a low-dimensional set, we show that the convergence rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension.
    Residuals-based distributionally robust optimization with covariate information. (arXiv:2012.01088v2 [math.OC] UPDATED)
    We consider data-driven approaches that integrate a machine learning prediction model within distributionally robust optimization (DRO) given limited joint observations of uncertain parameters and covariates. Our framework is flexible in the sense that it can accommodate a variety of regression setups and DRO ambiguity sets. We investigate asymptotic and finite sample properties of solutions obtained using Wasserstein, sample robust optimization, and phi-divergence-based ambiguity sets within our DRO formulations, and explore cross-validation approaches for sizing these ambiguity sets. Through numerical experiments, we validate our theoretical results, study the effectiveness of our approaches for sizing ambiguity sets, and illustrate the benefits of our DRO formulations in the limited data regime even when the prediction model is misspecified.
    A Kernel Stein Test for Comparing Latent Variable Models. (arXiv:1907.00586v4 [stat.ML] UPDATED)
    We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.
    Testing for Outliers with Conformal p-values. (arXiv:2104.08279v3 [stat.ME] UPDATED)
    This paper studies the construction of p-values for nonparametric outlier detection, taking a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a broadly applicable framework which yields p-values that are marginally valid but mutually dependent for different test points. We prove these p-values are positively dependent and enable exact false discovery rate control, although in a relatively weak marginal sense. We then introduce a new method to compute p-values that are both valid conditionally on the training data and independent of each other for different test points; this paves the way to stronger type-I error guarantees. Our results depart from classical conformal inference as we leverage concentration inequalities rather than combinatorial arguments to establish our finite-sample guarantees. Furthermore, our techniques also yield a uniform confidence bound for the false positive rate of any outlier detection algorithm, as a function of the threshold applied to its raw statistics. Finally, the relevance of our results is demonstrated by numerical experiments on real and simulated data.
    Probabilistic model-error assessment of deep learning proxies: an application to real-time inversion of borehole electromagnetic measurements. (arXiv:2205.12684v1 [physics.geo-ph])
    The advent of fast sensing technologies allows for real-time model updates in many applications where the model parameters are uncertain. Bayesian algorithms, such as ensemble smoothers, offer a real-time probabilistic inversion accounting for uncertainties. However, they rely on the repeated evaluation of the computational models, and deep neural network (DNN) based proxies can be useful to address this computational bottleneck. This paper studies the effects of the approximate nature of the deep learned models and associated model errors during the inversion of extra-deep borehole electromagnetic (EM) measurements, which are critical for geosteering. Using a deep neural network (DNN) as a forward model allows us to perform thousands of model evaluations within seconds, which is very useful for quantifying uncertainties and non-uniqueness in real-time. While significant efforts are usually made to ensure the accuracy of the DNN models, it is known that they contain unknown model errors in the regions not covered by the training data. When DNNs are utilized during inversion of EM measurements, the effects of the model errors could manifest themselves as a bias in the estimated input parameters and, consequently, might result in a low-quality geosteering decision. We present numerical results highlighting the challenges associated with the inversion of EM measurements while neglecting model error. We further demonstrate the utility of a recently proposed flexible iterative ensemble smoother in reducing the effect of model bias by capturing the unknown model errors, thus improving the quality of the estimated subsurface properties for geosteering operation. Moreover, we describe a procedure for identifying inversion multimodality and propose possible solutions to alleviate it in real-time.
    EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures. (arXiv:2205.10390v2 [cs.LG] UPDATED)
    Protein complexes are macromolecules essential to the functioning and well-being of all living organisms. As the structure of a protein complex, in particular its region of interaction between multiple protein subunits (i.e., chains), has a notable influence on the biological function of the complex, computational methods that can quickly and effectively be used to refine and assess the quality of a protein complex's 3D structure can directly be used within a drug discovery pipeline to accelerate the development of new therapeutics and improve the efficacy of future vaccines. In this work, we introduce the Equivariant Graph Refiner (EGR), a novel E(3)-equivariant graph neural network (GNN) for multi-task structure refinement and assessment of protein complexes. Our experiments on new, diverse protein complex datasets, all of which we make publicly available in this work, demonstrate the state-of-the-art effectiveness of EGR for atomistic refinement and assessment of protein complexes and outline directions for future work in the field. In doing so, we establish a baseline for future studies in macromolecular refinement and structure analysis.
    Additive Logistic Mechanism for Privacy-Preserving Self-Supervised Learning. (arXiv:2205.12430v1 [cs.LG])
    We study the privacy risks that are associated with training a neural network's weights with self-supervised learning algorithms. Through empirical evidence, we show that the fine-tuning stage, in which the network weights are updated with an informative and often private dataset, is vulnerable to privacy attacks. To address the vulnerabilities, we design a post-training privacy-protection algorithm that adds noise to the fine-tuned weights and propose a novel differential privacy mechanism that samples noise from the logistic distribution. Compared to the two conventional additive noise mechanisms, namely the Laplace and the Gaussian mechanisms, the proposed mechanism uses a bell-shaped distribution that resembles the distribution of the Gaussian mechanism, and it satisfies pure $\epsilon$-differential privacy similar to the Laplace mechanism. We apply membership inference attacks on both unprotected and protected models to quantify the trade-off between the models' privacy and performance. We show that the proposed protection algorithm can effectively reduce the attack accuracy to roughly 50\%-equivalent to random guessing-while maintaining a performance loss below 5\%.
    When Is Partially Observable Reinforcement Learning Not Scary?. (arXiv:2204.08967v2 [cs.LG] UPDATED)
    Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.
    Low-rank Optimal Transport: Approximation, Statistics and Debiasing. (arXiv:2205.12365v1 [stat.ML])
    The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e.g. single-cell genomics) or used to improve more complex methods (e.g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in \cite{scetbon2021lowrank} holds several promises in that regard, and was shown to complement more established entropic regularization approaches, being able to insert itself in more complex pipelines, such as quadratic OT. LOT restricts the search for low-cost couplings to those that have a low-nonnegative rank, yielding linear time algorithms in cases of interest. However, these promises can only be fulfilled if the LOT approach is seen as a legitimate contender to entropic regularization when compared on properties of interest, where the scorecard typically includes theoretical properties (statistical bounds, relation to other methods) or practical aspects (debiasing, hyperparameter tuning, initialization). We target each of these areas in this paper in order to cement the impact of low-rank approaches in computational OT.
    Fast calculation of Gaussian Process multiple-fold cross-validation residuals and their covariances. (arXiv:2101.03108v2 [stat.ME] UPDATED)
    We generalize fast Gaussian process leave-one-out formulae to multiple-fold cross-validation, highlighting in turn in broad settings the covariance structure of cross-validation residuals. The employed approach, that relies on block matrix inversion via Schur complements, is applied to both Simple and Universal Kriging frameworks. We illustrate how resulting covariances affect model diagnostics and how to properly transform residuals in the first place. Beyond that, we examine how accounting for dependency between such residuals affect cross-validation-based estimation of the scale parameter. It is found in two distinct cases, namely in scale estimation and in broader covariance parameter estimation via pseudo-likelihood, that correcting for covariances between cross-validation residuals leads back to maximum likelihood estimation or to an original variation thereof. The proposed fast calculation of Gaussian Process multiple-fold cross-validation residuals is implemented and benchmarked against a naive implementation, all in R language. Numerical experiments highlight the accuracy of our approach as well as the substantial speed-ups that it enables. It is noticeable however, as supported by a discussion on the main drivers of computational costs and by a dedicated numerical benchmark, that speed-ups steeply decline as the number of folds (say, all sharing the same size) decreases. Overall, our results enable fast multiple-fold cross-validation, have direct consequences in GP model diagnostics, and pave the way to future work on hyperparameter fitting as well as on the promising field of goal-oriented fold design.
    Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit. (arXiv:1905.13654v11 [stat.ML] UPDATED)
    Recent work by Jacot et al. (2018) has shown that training a neural network using gradient descent in parameter space is related to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result by establishing that the output of a neural network trained using gradient descent can be approximated by a linear model when the network width is large. Indeed, under regularity conditions, the NTK converges to a time-independent kernel in the infinite-width limit. This regime is often called the NTK regime. In parallel, recent works on signal propagation (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019a) studied the impact of the initialization and the activation function on signal propagation in deep neural networks. In this paper, we connect these two theories by quantifying the impact of the initialization and the activation function on the NTK when the network depth becomes large. In particular, we provide a comprehensive analysis of the convergence rates of the NTK regime to the infinite depth regime.
    Clustering consistency with Dirichlet process mixtures. (arXiv:2205.12924v1 [math.ST])
    Dirichlet process mixtures are flexible non-parametric models, particularly suited to density estimation and probabilistic clustering. In this work we study the posterior distribution induced by Dirichlet process mixtures as the sample size increases, and more specifically focus on consistency for the unknown number of clusters when the observed data are generated from a finite mixture. Crucially, we consider the situation where a prior is placed on the concentration parameter of the underlying Dirichlet process. Previous findings in the literature suggest that Dirichlet process mixtures are typically not consistent for the number of clusters if the concentration parameter is held fixed and data come from a finite mixture. Here we show that consistency for the number of clusters can be achieved if the concentration parameter is adapted in a fully Bayesian way, as commonly done in practice. Our results are derived for data coming from a class of finite mixtures, with mild assumptions on the prior for the concentration parameter and for a variety of choices of likelihood kernels for the mixture.
    Tell me why! Explanations support learning relational and causal structure. (arXiv:2112.03753v3 [cs.LG] UPDATED)
    Inferring the abstract relational and causal structure of the world is a major challenge for reinforcement-learning (RL) agents. For humans, language--particularly in the form of explanations--plays a considerable role in overcoming this challenge. Here, we show that language can play a similar role for deep RL agents in complex environments. While agents typically struggle to acquire relational and causal knowledge, augmenting their experience by training them to predict language descriptions and explanations can overcome these limitations. We show that language can help agents learn challenging relational tasks, and examine which aspects of language contribute to its benefits. We then show that explanations can help agents to infer not only relational but also causal structure. Language can shape the way that agents to generalize out-of-distribution from ambiguous, causally-confounded training, and explanations even allow agents to learn to perform experimental interventions to identify causal relationships. Our results suggest that language description and explanation may be powerful tools for improving agent learning and generalization.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v1 [cs.LG])
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v1 [stat.ML])
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on public data, then using private data only for "transfer learning." In particular, we minimize the maximum mean discrepancy (MMD) between private target data and the generated distribution, using a kernel based on perceptual features from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images faithfully with $\varepsilon \approx 2$, far surpassing the current state of the art, which only models MNIST and FashionMNIST at $\varepsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.
    Imposing Gaussian Pre-Activations in a Neural Network. (arXiv:2205.12379v1 [cs.LG])
    The goal of the present work is to propose a way to modify both the initialization distribution of the weights of a neural network and its activation function, such that all pre-activations are Gaussian. We propose a family of pairs initialization/activation, where the activation functions span a continuum from bounded functions (such as Heaviside or tanh) to the identity function. This work is motivated by the contradiction between existing works dealing with Gaussian pre-activations: on one side, the works in the line of the Neural Tangent Kernels and the Edge of Chaos are assuming it, while on the other side, theoretical and experimental results challenge this hypothesis. The family of pairs initialization/activation we are proposing will help us to answer this hot question: is it desirable to have Gaussian pre-activations in a neural network?
    Understanding Programmatic Weak Supervision via Source-aware Influence Function. (arXiv:2205.12879v1 [cs.LG])
    Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. To achieve this, we build on Influence Function (IF) and propose source-aware IF, which leverages the generation process of the probabilistic labels to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) tuple. These primitive influence score can then be used to estimate the influence of individual component of PWS, such as source vote, supervision source, and training data. On datasets of diverse domains, we demonstrate multiple use cases: (1) interpreting incorrect predictions from multiple angles that reveals insights for debugging the PWS pipeline, (2) identifying mislabeling of sources with a gain of 9%-37% over baselines, and (3) improving the end model's generalization performance by removing harmful components in the training objective (13%-24% better than ordinary IF).
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v1 [stat.ML])
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    Learning from time-dependent streaming data with online stochastic algorithms. (arXiv:2205.12549v1 [cs.LG])
    We study stochastic algorithms in a streaming framework, trained on samples coming from a dependent data source. In this streaming framework, we analyze the convergence of Stochastic Gradient (SG) methods in a non-asymptotic manner; this includes various SG methods such as the well-known stochastic gradient descent (i.e., Robbins-Monro algorithm), mini-batch SG methods, together with their averaged estimates (i.e., Polyak-Ruppert averaged). Our results form a heuristic by linking the level of dependency and convexity to the rest of the model parameters. This heuristic provides new insights into choosing the optimal learning rate, which can help increase the stability of SGbased methods; these investigations suggest large streaming batches with slow decaying learning rates for highly dependent data sources.
    Conformal Prediction Intervals with Temporal Dependence. (arXiv:2205.12940v1 [stat.ML])
    Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time-series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure which is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time-series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.
    Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models. (arXiv:2205.12746v1 [q-fin.CP])
    This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
    Mitigating multiple descents: A model-agnostic framework for risk monotonization. (arXiv:2205.12937v1 [math.ST])
    Recent empirical and theoretical analyses of several commonly used prediction procedures reveal a peculiar risk behavior in high dimensions, referred to as double/multiple descent, in which the asymptotic risk is a non-monotonic function of the limiting aspect ratio of the number of features or parameters to the sample size. To mitigate this undesirable behavior, we develop a general framework for risk monotonization based on cross-validation that takes as input a generic prediction procedure and returns a modified procedure whose out-of-sample prediction risk is, asymptotically, monotonic in the limiting aspect ratio. As part of our framework, we propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting, respectively, and show that, under very mild assumptions, they provably achieve monotonic asymptotic risk behavior. Our results are applicable to a broad variety of prediction procedures and loss functions, and do not require a well-specified (parametric) model. We exemplify our framework with concrete analyses of the minimum $\ell_2$, $\ell_1$-norm least squares prediction procedures. As one of the ingredients in our analysis, we also derive novel additive and multiplicative forms of oracle risk inequalities for split cross-validation that are of independent interest.
    Linear Algorithms for Nonparametric Multiclass Probability Estimation. (arXiv:2205.12460v1 [stat.ME])
    Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) have been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wang, Shen and Liu, 2008; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demand polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v3 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework targets the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where exact inference of latent variables is typically intractable. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space often lacks physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification of nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential for enhancing and generalizing the predictive capabilities of deep learning-based models.
    Multi-Agent Low-Dimensional Linear Bandits. (arXiv:2007.01442v4 [cs.LG] UPDATED)
    We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector $\theta^* \in \mathbb{R}^d$. The side information consists of a finite collection of low-dimensional subspaces, one of which contains $\theta^*$. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. By distributing the search for the optimal subspace across users and learning of the unknown vector by each agent in the corresponding low-dimensional subspace, we show that the per-agent finite-time regret is much smaller than the case when agents do not communicate. We finally complement these results through simulations.
    Taming Nonconvexity in Kernel Feature Selection -- Favorable Properties of the Laplace Kernel. (arXiv:2106.09387v3 [math.ST] UPDATED)
    Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $\ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples.
    Learning the Travelling Salesperson Problem Requires Rethinking Generalization. (arXiv:2006.07054v6 [cs.LG] UPDATED)
    End-to-end training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While state-of-the-art learning-driven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an end-to-end neural combinatorial optimization pipeline that unifies several recent papers in order to identify the inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zero-shot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols. Additionally, we analyze recent advances in deep learning for routing problems through the lens of our pipeline and provide new directions to stimulate future research.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v2 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity. (arXiv:2104.13790v3 [cs.LG] UPDATED)
    AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is theoretically appealing in that it has a data-dependent $O(\sqrt{T})$ regret bound when objective functions are convex, where $T$ is a time horizon. It remains however an open problem whether the convergence rate can be further improved without sacrificing its generalization ability. %on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To this end, we make a first attempt in this work and design a novel optimization algorithm called FastAdaBelief that aims to exploit its strong convexity in order to achieve an even faster convergence rate. In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence. As an important theoretical contribution, we prove that FastAdaBelief attains a data-dependant $O(\log T)$ regret bound, which is substantially lower than AdaBelief. On the empirical side, we validate our theoretical analysis with extensive experiments in both scenarios of strong and non-strong convexity on three popular baseline models. Experimental results are very encouraging: FastAdaBelief converges the quickest in comparison to all mainstream algorithms while maintaining an excellent generalization ability, in cases of both strong or non-strong convexity. FastAdaBelief is thus posited as a new benchmark model for the research community.
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v1 [cs.LG])
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v1 [cs.LG])
    Learning causal structure poses a combinatorial search problem that typically involves evaluating structures using a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize the process of causal structure learning. Rather than searching over causal structures directly, we train a variational inference model to predict the causal structure from observational/interventional data. Our inference model acquires domain-specific inductive bias for causal discovery solely from data generated by a simulator. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Moreover, the architecture of our inference model is permutation invariant w.r.t. the data points and permutation equivariant w.r.t. the variables, facilitating generalization to significantly larger problem instances than seen during training. On synthetic data and semi-synthetic gene expression data, our models exhibit robust generalization capabilities under substantial distribution shift and significantly outperform existing algorithms, especially in the challenging genomics domain.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v2 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other.
    A scalable multi-step least squares method for network identification with unknown disturbance topology. (arXiv:2106.07548v3 [eess.SY] UPDATED)
    Identification methods for dynamic networks typically require prior knowledge of the network and disturbance topology, and often rely on solving poorly scalable non-convex optimization problems. While methods for estimating network topology are available in the literature, less attention has been paid to estimating the disturbance topology, i.e., the (spatial) noise correlation structure and the noise rank in a filtered white noise representation of the disturbance signal. In this work we present an identification method for dynamic networks, in which an estimation of the disturbance topology precedes the identification of the full dynamic network with known network topology. To this end we extend the multi-step Sequential Linear Regression and Weighted Null Space Fitting methods to deal with reduced rank noise, and use these methods to estimate the disturbance topology and the network dynamics in the full measurement situation. As a result, we provide a multi-step least squares algorithm with parallel computation capabilities and that rely only on explicit analytical solutions, thereby avoiding the usual non-convex optimizations involved. Consequently we consistently estimate dynamic networks of Box Jenkins model structure, while keeping the computational burden low. We provide a consistency proof that includes path-based data informativity conditions for allocation of excitation signals in the experimental design. Numerical simulations performed on a dynamic network with reduced rank noise clearly illustrate the potential of this method.
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v1 [stat.ML])
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is one of the most effective approaches to defend against such examples. We show that for linear regression problems, adversarial training can be formulated as a convex problem. This fact is then used to show that $\ell_\infty$-adversarial training produces sparse solutions and has many similarities to the lasso method. Similarly, $\ell_2$-adversarial training has similarities with ridge regression. We use a robust regression framework to analyze and understand these similarities and also point to some differences. Finally, we show how adversarial training behaves differently from other regularization methods when estimating overparameterized models (i.e., models with more parameters than datapoints). It minimizes a sum of three terms which regularizes the solution, but unlike lasso and ridge regression, it can sharply transition into an interpolation mode. We show that for sufficiently many features or sufficiently small regularization parameters, the learned model perfectly interpolates the training data while still exhibiting good out-of-sample performance.  ( 2 min )
    On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity. (arXiv:2205.12642v1 [stat.ML])
    Most complex machine learning and modelling techniques are prone to over-fitting and may subsequently generalise poorly to future data. Artificial neural networks are no different in this regard and, despite having a level of implicit regularisation when trained with gradient descent, often require the aid of explicit regularisers. We introduce a new framework, Model Gradient Similarity (MGS), that (1) serves as a metric of regularisation, which can be used to monitor neural network training, (2) adds insight into how explicit regularisers, while derived from widely different principles, operate via the same mechanism underneath by increasing MGS, and (3) provides the basis for a new regularisation scheme which exhibits excellent performance, especially in challenging settings such as high levels of label noise or limited sample sizes.  ( 2 min )
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v1 [cs.LG])
    We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.  ( 2 min )
    Gradient-based explanations for Gaussian Process regression and classification models. (arXiv:2205.12797v1 [cs.LG])
    Gaussian Processes (GPs) have proven themselves as a reliable and effective method in probabilistic Machine Learning. Thanks to recent and current advances, modeling complex data with GPs is becoming more and more feasible. Thus, these types of models are, nowadays, an interesting alternative to Neural and Deep Learning methods, which are arguably the current state-of-the-art in Machine Learning. For the latter, we see an increasing interest in so-called explainable approaches - in essence methods that aim to make a Machine Learning model's decision process transparent to humans. Such methods are particularly needed when illogical or biased reasoning can lead to actual disadvantageous consequences for humans. Ideally, explainable Machine Learning should help detect such flaws in a model and aid a subsequent debugging process. One active line of research in Machine Learning explainability are gradient-based methods, which have been successfully applied to complex neural networks. Given that GPs are closed under differentiation, gradient-based explainability for GPs appears as a promising field of research. This paper is primarily focused on explaining GP classifiers via gradients where, contrary to GP regression, derivative GPs are not straightforward to obtain.  ( 2 min )
    Deep interpretable ensembles. (arXiv:2205.12729v1 [stat.ML])
    Ensembles improve prediction performance and allow uncertainty quantification by aggregating predictions from multiple models. In deep ensembling, the individual models are usually black box neural networks, or recently, partially interpretable semi-structured deep transformation models. However, interpretability of the ensemble members is generally lost upon aggregation. This is a crucial drawback of deep ensembles in high-stake decision fields, in which interpretable models are desired. We propose a novel transformation ensemble which aggregates probabilistic predictions with the guarantee to preserve interpretability and yield uniformly better predictions than the ensemble members on average. Transformation ensembles are tailored towards interpretable deep transformation models but are applicable to a wider range of probabilistic neural networks. In experiments on several publicly available data sets, we demonstrate that transformation ensembles perform on par with classical deep ensembles in terms of prediction performance, discrimination, and calibration. In addition, we demonstrate how transformation ensembles quantify both aleatoric and epistemic uncertainty, and produce minimax optimal predictions under certain conditions.  ( 2 min )
    Multimodal active speaker detection and virtual cinematography for video conferencing. (arXiv:2002.03977v3 [eess.AS] UPDATED)
    Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.  ( 2 min )

  • Open

    how to start reinforcement learning, as a electrical and electronic control engineer?
    submitted by /u/Ibrahim_Attawil [link] [comments]  ( 1 min )
    Sneak Peak of Animo Island, a PC game that empowers players to explore reinforcement learning as a game mechanic - Currently looking for play testers for an upcoming beta release!
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    What universities are hubs for reinforcement learning research?
    I am currently working through Reinforcement Learning by Sutton and Barto. It is a great book, but since it is such a sub-speciality of machine learning I am wondering what universities have research being done in the field and not just machine learning in general? Where are top papers coming from? submitted by /u/caedin8 [link] [comments]  ( 1 min )
    [D] confusion about KL divergence in PPO (implementation confusion)
    Hi all, I am following this workflow for PPO using keras and I have come across something that doesn't make sense to me, any help and clarification is appreciated - I am following PPO keras . In this line of code when calculating the ratio: def train_policy( observation_buffer, action_buffer, logprobability_buffer, advantage_buffer ): with tf.GradientTape() as tape: # Record operations for automatic differentiation. ratio = tf.exp( logprobabilities(actor(observation_buffer), action_buffer) - logprobability_buffer) I am confused about the subtraction from the logprobability buffer. ​ the line that preceeds this is the collection of information from the buffer: # Get values from the buffer ( observation_buffer, action_buffer, advantage_buffer, return_buffer, logprobability_buffer, ) = buffer.get() I feel like there is some tautology here although i am probably missing something very obvious(with some misunderstanding of KL divergence/importance sampling i'm sure). The log-probabilities function takes the logits from the actor model, then carries out a softmax over those logits - takes the action that was selected, and then returns the log of the action that was selected. but in the train step, we take the observation and actions from the buffer (the same ones for the epoch that has just been run)... call the log probabilities to take a batch of log probs with each state-action pair and the subtract from the log-probabilities from the buffer. however, the log-probabilities from the buffer ARE the same log-probabilities that the actor model will spit out, as they are the same observations and actions that led to log probs being inserted in to the buffer - and, the weights have not been updated by this point and so the softmax will be identical. what is the point of subtracting the same log-probabilities from the log-probability buffer? i don't see here the traditional 'old' vs 'new' policy. thank you! submitted by /u/amjass12 [link] [comments]  ( 2 min )
    do you guys have any ressources/tutorials on how to implement your own RL environment/Agents
    i'm new to RL and i'm trying to use it for a classification task ... yet i'm not quite sure how should i intialize the environment, agents.. any recommendations? submitted by /u/Affectionate_Worth43 [link] [comments]  ( 1 min )
    What are the types of tasks we can perform given an environment in MA Reinforcement Learning?
    Hello, I am new to reinforcement learning and I need to find a game environment and make agents corporate in an optimal way. I found some environments (petting zoo, ML agents Soccer-twos..). I read about RL algorithms such as MADDPG and my intuition was to integrate such algorithms into some environments I found. I want to know if this task is interesting, and any suggestions for tasks we can execute on MARL envs. submitted by /u/Ok_Lab_2750 [link] [comments]  ( 1 min )
    "HyperTree Proof Search for Neural Theorem Proving", Lemple et al 2022 {FB} (56% -> 65% MetaMath proofs)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    [P] Scale ML experiments from JupyterLab to the cloud
    First Medium article is out! Come see how Optumi is thinking about the shifting workflow needs of data science and machine learning professionals. https://medium.com/@optumi/scale-ml-experiments-from-jupyterlab-to-the-cloud-141bd645d8e9 submitted by /u/chrismarrie [link] [comments]  ( 1 min )
    [D] Google Imagen authors now produce images based on your prompt!
    If you are interested in getting your text converted to an image by Google Brain Imagen use the following link: https://twitter.com/mo_norouzi/status/1529497457234780162?s=20&t=3K_M972bMeGRR2wG6kobHQ submitted by /u/aifordummies [link] [comments]
    [D] From classification to regression and some physics analogies
    Hello, Here is: https://www.researchgate.net/publication/360541388_From_Classification_to_Regression_A_Note_on_Deodata_Predictors The paper describes how to adapt a set of classification algorithms in order to perform nonlinear regression. The algorithms are described with simple numerical examples. In the "Field Sampling Density" section, the described operation is akin to estimating the strength of a field. I am interested in your opinions. Thanks. submitted by /u/crispub [link] [comments]  ( 1 min )
    [N] Pull Requests and Discussions on Hugging Face
    Hey, it's Merve from Hugging Face 👋 I wanted to share some big news that I hope you find useful. The 🤗 Hub now has pull requests (PR) and discussions in repositories to improve collaboration in machine learning 🥳✨ What does PR really mean here? Let’s assume you have a big PyTorch model and someone else ported it to TensorFlow, that person can contribute that to your model repository. Someone else can open a PR to improve your model, fix your machine learning demo in a Space or change anything in the dataset. This applies for model/Space/dataset (any repo) repositories on the hub. You might say this sounds familiar to GitHub. For code, GitHub works super well and we don’t want to (and it would be very inefficient to) recreate the feature set of GitHub. What we want to focus on is creating the collaboration toolset that’s optimized for ML. You can learn more about these new features here: https://huggingface.co/blog/community-update. Looking forward to your feedback and suggestions! ✨ Hope this is useful 🙂 submitted by /u/unofficialmerve [link] [comments]  ( 2 min )
    [D] PyTorch processes taking up tons of GPU memory - any way to reduce this?
    I am running on Arch Linux 5.17.9-arch1-1 with an NVIDIA GeForce RTX 3090 GPU. I need to run multiple processes for a reinforcement learning task, where each subprocess runs the data collection (and inference) and all the samples from that are then retrieved via queues in the main process and optimized (e.g. think of PPO but distributed, like IMPALA). I am using torch.multiprocessing for this. Unfortunately, the multiple spawned subprocesses cause A LOT of overhead in terms of GPU memory being used. See below for my nvidia-smi output: | 0 N/A N/A 559025 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559026 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559027 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559028 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559029 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559030 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559031 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559032 C ...3/envs/ml/bin/python 1873MiB | | 0 N/A N/A 559033 C ...3/envs/ml/bin/python 1873MiB | So it seems like each subprocess loads the entirety of all of PyTorch into GPU memory, which seems incredibly inefficient. Is there a way to get the subprocesses to only load this once and then share it? How can I reduce the GPU footprint for each process? EDIT: Even using a basic example from the PyTorch repo I can see the same problem: Because it's not forking, it seems to be using up tons of GPU memory for each process. Can this not be fixed? submitted by /u/tmuxed [link] [comments]  ( 2 min )
    [P] ZenML: Build vendor-agnostic, production-ready MLOps pipelines
    Hello r/MachineLearning! Some here might remember we open-sourced ZenML, a year or so ago, and started building it out in the open. Today, we're re-launching it to the world, with a brand-new look, and a sharper focus. We've spoken to hundreds of ML teams in the last year, and here is what we've found: 🐘 Getting ML into production reliably is still hard today. It takes too long, is too complicated, and not enough people know how to do it. 🦡 MLOps platforms are not the answer because they are opinionated, rigid, and slow to change. It's time for MLOps frameworks to shine, and bring structure to the ML ecosystem that is ripe for standardization. 🐼 Well-thought-out abstractions that make sense and are flexible are what the industry needs. Our launch blog post, "The Framework Way is t…  ( 2 min )
    [P] Looking for advice how to apply clustering to learned embeddings of user-item interactions
    hi, I'm working on a consumer segmentation job, where the goal is to understand if there are subgroups of consumers who behave in a similar way to each other, but different from the rest of the population. My dataset contains a user interacting with an item for a period of time, with a reasonable assumption that more time spent means a more favorable view of the item (so we can take the time spent as a proxy for the user's taste). So far, I've created an ML model to learn embeddings of size 64 to represent my approximately 950k users and their interactions with the ~10k items. My original plan was to apply k-means clustering to those 64-dimensional user embeddings. However, this approach isn't yielding the degree of separation I require (e.g. the top most popular items to interact with are all the same ones for each cluster). Trying different values for k, I also get a basically entirely flat elbow graph. How should I proceed from here? I've thought of 2 options: - retraining my embeddings with a smaller size and retrying with k-means - researching an alternative clustering algorithm is there anything I haven't considered yet, but should? If no, which of these 2 approaches would you explore first, and if you prefer the latter, which algorithm(s) would you test out first? Thanks for your help! submitted by /u/the_Wallie [link] [comments]  ( 2 min )
    [D] Different input image size when using Visual Transformers
    I have an image classification problem, and have been using ResNet. The dense layers at the end are replaced with 1x1 convolutions, making the model fully convolutional. Classification is done on 128 x 128 patches, so if the input image is 128 x 128, I'll get output size 1x1. If image is 512 x 512, the output will be 4x4. Each output element will hold the prediction to which class the patch at that position belongs to. Now I'd like to try using transformer instead of ResNet. Can a similar thing be done with Vision Transformers? Are there any examples of that being done? submitted by /u/alkibijad [link] [comments]  ( 1 min )
    [P] Second-tier Recommender System in FunCorp
    Matrix decomposition is not perfect for improvement of recommendation systems. for example, you will find it hard to add gender and age of a user. In this article, we describe how we implemented a second, ranking level of the model above the collaborative one, and how two-stage recommendation systems help us to apply more complex algorithm https://medium.com/@FunCorp/putting-a-two-layered-recommendation-system-into-production-b8caaf61393d submitted by /u/Puzzleheaded_Egg_396 [link] [comments]
    [D] Does TensorFlow Lite use the DropIT method to handle intermediate tensors?
    This blog post (Optimizing TensorFlow Lite Runtime Memory) says that TensorFlow Lite employs different approaches to handle intermediate tensors which occupy large amounts of memory. Is one of them DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training method? submitted by /u/teraRockstar [link] [comments]  ( 1 min )
    [D] ISTM results are out - First International Symposium on the Tsetlin Machine
    ​ ISTM Technical Program You find the full technical program here: https://istm.no/program/ submitted by /u/olegranmo [link] [comments]  ( 1 min )
    [P] Image Background Changer : You can change background to whatever you want.
    This project was made using rembg package that performs image segmentation with U^2-Net. https://reddit.com/link/uxathe/video/chhv2eewbk191/player submitted by /u/supercornson [link] [comments]  ( 1 min )
    [Discussion] Best Affordable way for doing ML online (Colab Pro, etc)
    I have exhausted free gcp credits and I was wondering what are the best (affordable) ways to do machine learning online. Colab Pro (10 USD per month), Kaggle, Paperspace Gradient, etc come to mind. Any thoughts on which is the best? PS: I have used Colab Free version and Kaggle before - The session timeouts that lead you to re-run the notebook from the beginning is the worst experience ever. Also, the only information I could find about Gradient was from the founder, which obviously is not very reliable. Anybody else used it? submitted by /u/BigNet1356 [link] [comments]  ( 4 min )
    [p]Serverless Model training platform for any cloud. Alternative to Sagemaker
    Hi all, We've been working on building a serverless model training platform that can run on any cloud provider. We have a beta version of the platform currently working on Aws and Azure. Looking forward to your feedback and if this is something enterprise users will use? we are at https://cloud.netbook.ai/ submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [P] fastdup: tool for curating computer vision datasets at scale
    https://github.com/visualdatabase/fastdup submitted by /u/gradientflow [link] [comments]
  • Open

    Deep Learning with Label Differential Privacy
    Posted by Pasin Manurangsi and Chiyuan Zhang, Research Scientists, Google Research Over the last several years, there has been an increased focus on developing differential privacy (DP) machine learning (ML) algorithms. DP has been the basis of several practical deployments in industry — and has even been employed by the U.S. Census — because it enables the understanding of system and algorithm privacy guarantees. The underlying assumption of DP is that changing a single user’s contribution to an algorithm should not significantly change its output distribution. In the standard supervised learning setting, a model is trained to make a prediction of the label for each input given a training set of example pairs {[input1,label1], …, [inputn, labeln]}. In the case of deep learning, previ…  ( 8 min )
  • Open

    GA / Evolution noobee question
    Say I have hundred NNs of which 90 perform badly and ten perform "ok" to play a game. Now how do I seed new NNs from these ten? What "makes" a particular NN is its weights. So how do I now generate new NNs from these ten with random mutations? How do I even know that there is something like an "optimal" NN that can actually play the game? With a math background, I think there must be something like a "solution space", and it's not clear to me that a NN obtained through random mutations of the weights is even in a solution space. Thanks! submitted by /u/CantFixMoronic [link] [comments]  ( 1 min )
    Which type of learning is used in an evolutionary simulation?
    I'm creating an evolutionary simulator. I have creatures that I want to teach how to gather food. My creatures have a neural network brain, they have various input values for things such as location, proximity to food etc. They have 2 output neurons - moveX and moveY. Their brains are represented using a genetic code. Upon procreation with another creature, the genes from both parents are randomly inherited and mutations can occur. After a few thousand cycles, the creatures successfully learn how to gather food in the most efficient way possible. What type of learning would this fall into? I wouldn't say it is supervised learning since nobody is labeling the data, nor am I calculating some fitness value. Their survival is determined purely by the simulation conditions. I would guess that this falls under reinforcement learning, but I'm not sure, it might also be unsupervised learning. Can you help me find the proper terminology? submitted by /u/Log_Dogg [link] [comments]  ( 1 min )
    How to Optimize your HuggingFace Transformers
    submitted by /u/aidev2040 [link] [comments]
    A grounded deep symbolic neural network for perception
    Here is a paper titled "A Grounded Deep Symbolic Neural Network for Perception" that I am planning to submit to the NeSy 2022 workshop. It is one of a number of workshops in the Second International Conference on Learning and Reasoning (IJCLR 2022) in Windsor, UK on 28th – 30th September. Papers are due by May 31st. I would appreciate any constructive feedback you would like to make. https://www.adaptroninc.com/sites/default/files/inline-files/Grounded_Deep_Symbolic_Neural_Network_for_comments.pdf Abstract Both animals and artificial intelligent agents rely upon the identification of types of objects and events during perception. It is a categorization process, which senses, recognizes and encodes invariant features. Binary neurons (binons) are general-purpose artificial neural nodes for representing properties, objects, events and relationships between them. Non-symbolic binons are used in short-term memory to represent core sensory properties such as position, intensity and time and ones derived from them. Ratios derived from these properties are converted into invariant symbolic categories using a novel discretization algorithm based on the Weber-Fechner psychophysical laws. Symbolic binons are combined to form deep hierarchical neural networks that comprise long-term memory. It contains spatial and temporal binons representing the shape and contrast patterns for categories of objects and events. They are grounded on the core and derived properties. Empirical evidence of their successful use in classifying handwritten digits was provided by Martensen in 2013[1]. The neural networks are 100% symbolic, transparent, compositional, scalable and sparse. Learning is continuous and unsupervised. submitted by /u/BrettNMartensen [link] [comments]  ( 1 min )
  • Open

    Is diversity the key to collaboration? New AI research suggests so
    A new training approach yields artificial intelligence that adapts to diverse play-styles in a cooperative game, in what could be a win for human-AI teaming.  ( 6 min )
    Early sound exposure in the womb shapes the auditory system
    Modeling study suggests that the muffled environment in utero primes the brain’s ability to interpret some types of sound.  ( 7 min )
  • Open

    Hugging Face now allows Pull Requests and Discussions on repositories
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Microsoft announces 3 months of unlimited Codex access and free tokens for OpenAI's API
    submitted by /u/Wireless_Life [link] [comments]  ( 1 min )
    How to Optimize your HuggingFace Transformers
    submitted by /u/aidev2040 [link] [comments]
    Artificial Intelligence Implications: The Future of Digital Marketing | Hey Everyone, This is my final post in a university blog series surrounding AI, The Future, and a personal area of interest! Would love some feedback and advice for future blog-style posts!
    submitted by /u/RvZz11 [link] [comments]  ( 1 min )
    GitHub - A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2022 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art techniques!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    I got GPT-3 to write a novel
    submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    DeepMind Researchers Develop A Machine Learning Technique For Accurate Sampling And Free-Energy Estimate Of Solid Materials Using Normalizing Flows
    A significant challenge of computational statistical mechanics is the accurate estimation of equilibrium parameters of a thermodynamic system. For decades, the methods of choice for sampling such systems at large have been molecular dynamics (MD) and hybrid Monte Carlo. Strategies for sampling probability distributions have increased, and most try leveraging normalizing flows. Normalizing Flows are a technique for creating complicated distributions that involve changing a probability density through a sequence of invertible mappings. These are desirable because of 2 characteristics: first, they can create independent samples rapidly and in parallel, and second, they can offer the precise probability density of their creation method. Training a flow-based model to approximate a target distribution yields an efficient but approximate sampler, and re-weighting the samples by their probability density can then be used to remove estimation bias for free energy estimation. The exciting part about flows is that they allow us to obtain accurate estimates even without samples from thermodynamic states. Continue Reading Paper: https://iopscience.iop.org/article/10.1088/2632-2153/ac6b16/pdf Github: https://github.com/deepmind/flows\_for\_atomic\_solids https://preview.redd.it/9z56s4iggl191.png?width=512&format=png&auto=webp&s=4417c7225923b06ae9f585b497e97f3bcbf7c0de submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    [Project]PaddleSpeech: An Easy-to-use Speech Toolkit including SOTA/Streaming ASR witch punctuation, influential TTS with text frontend and the VPR System.
    Hi, all, Glad to share an open source repository PaddleSpeech, which provides SOTA/Streaming ASR witch punctuation, influential TTS with text frontend and a product-ready VPR System. Code:https://github.com/PaddlePaddle/PaddleSpeech Features Set: 📦 Ease of Use: low barriers to install. The CLIs are available to quick-start your project. 🔬 Align to the State-of-the-Art: provide high-speed and ultra-lightweight models, and also cutting-edge technology. 🏆 Streaming ASR and TTS System: provide production ready streaming asr and streaming tts system. 💯 Rule-based frontend: the frontend contains Text Normalization and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). 🛎️ Multi-language: both English and Chinese are supported. Examples: Speech Recognition Input wav: Input.wav Output text: I knocked at the door on the ancient side of the building. Text-to-Speech Input text: Life was like a box of chocolates, you never know what you're gonna get. Output wav: Output.wav submitted by /u/Aha_IamDaniel [link] [comments]  ( 1 min )
    Watch how ai make realistic photo from anime
    submitted by /u/Due-Ad9795 [link] [comments]
    AI to track social media behavior of mass shooters
    I know next to nothing about AI, but was wondering if you could use AI to track the behavior or "search" for a pattern of behavior similar to mass shooters. Maybe a more broad question would be, could you develop AI to study human behavior and predict mental state through social media posts. submitted by /u/Lvl-Up-Candy [link] [comments]  ( 1 min )
    Last Week in AI: Autonomous cargo ships, how AI is used in Hollywood, AI to search for guns in public, and more!
    submitted by /u/regalalgorithm [link] [comments]
  • Open

    10 Best Practices For Data Science
    For quite some time now, data science has enjoyed a reputation as the next big revolution in the tech and business landscape. The number of businesses employing the applications of data science has only increased in the recent few years. According to Statista, as of 2021, nearly 60 percent of companies are housing at least… Read More »10 Best Practices For Data Science The post 10 Best Practices For Data Science appeared first on Data Science Central.  ( 7 min )
    Python Book Goodies and Apache Arrow
    In my rundown this week, I cover two distinct topics – a new Python analytics books and the rise of Apache Arrow. A New Python Data Analytics Book Published One of the best-known books on data analysis now has a new edition (3rd edition) available as an early open access The creator of Pandas library,… Read More »Python Book Goodies and Apache Arrow The post Python Book Goodies and Apache Arrow appeared first on Data Science Central.  ( 3 min )
    How Valid-Page Metadata Helps Businesses Grow
    The Internet has added a new document known as valid-page metadata that explores how the Internet processes inconsistent and invalid HTML and deals with issues of invalid mark-up. The Internet updates the help document as the title link with a new section on these headlines and troubleshoots the area.  However, most businesses are not aware… Read More »How Valid-Page Metadata Helps Businesses Grow The post How Valid-Page Metadata Helps Businesses Grow appeared first on Data Science Central.  ( 4 min )
    Why You Need an Augmented Data Integration Tool
    Introduction We are in a time when information is the core element of business success for companies in almost any industry. As technologies emerge and find large-scale adoption, there is an influx of massive amounts of data within enterprises. Two primary challenges need to be solved to obtain the necessary information. First is trustable information… Read More »Why You Need an Augmented Data Integration Tool The post Why You Need an Augmented Data Integration Tool appeared first on Data Science Central.  ( 4 min )
    Pros And Cons of AI In Manufacturing
    The fourth industrial revolution has been a game-changer, with the global economy’s expansion driving the adoption of new technologies across sectors. Manufacturers are using AI software in product design, production, supply chain, and logistics. AI analytics and data are helping in improving product quality and efficiency. Advances in machine learning, artificial intelligence (AI), and Big… Read More »Pros And Cons of AI In Manufacturing The post Pros And Cons of AI In Manufacturing appeared first on Data Science Central.  ( 5 min )
  • Open

    Harmonic e
    Douglas Hofstadter discovered that the 8th harmonic number equals e. OK, not really. The following equation cannot possibly be true because the left side is rational and the right side is irrational. However, Hofstadter showed that the equation does hold if you carry all calculations out to three decimal places. 1.000 0.500 0.333 0.250 0.200 […] Harmonic e first appeared on John D. Cook.  ( 1 min )
  • Open

    Optimality Conditions and Algorithms for Top-K Arm Identification. (arXiv:2205.12086v1 [stat.ML])
    We consider the top-k arm identification problem for multi-armed bandits with rewards belonging to a one-parameter canonical exponential family. The objective is to select the set of k arms with the highest mean rewards by sequential allocation of sampling efforts. We propose a unified optimal allocation problem that identifies the complexity measures of this problem under the fixed-confidence, fixed-budget settings, and the posterior convergence rate from the Bayesian perspective. We provide the first characterization of its optimality. We provide the first provably optimal algorithm in the fixed-confidence setting for k>1. We also propose an efficient heuristic algorithm for the top-k arm identification problem. Extensive numerical experiments demonstrate superior performance compare to existing methods in all three settings.
    Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width. (arXiv:2205.12101v1 [cs.LG])
    Substantial work indicates that the dynamics of neural networks (NNs) is closely related to their initialization of parameters. Inspired by the phase diagram for two-layer ReLU NNs with infinite width (Luo et al., 2021), we make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width. First, we derive a normalized gradient flow for three-layer ReLU NNs and obtain two key independent quantities to distinguish different dynamical regimes for common initialization methods. With carefully designed experiments and a large computation cost, for both synthetic datasets and real datasets, we find that the dynamics of each layer also could be divided into a linear regime and a condensed regime, separated by a critical regime. The criteria is the relative change of input weights (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) as the width approaches infinity during the training, which tends to $0$, $+\infty$ and $O(1)$, respectively. In addition, we also demonstrate that different layers can lie in different dynamical regimes in a training process within a deep NN. In the condensed regime, we also observe the condensation of weights in isolated orientations with low complexity. Through experiments under three-layer condition, our phase diagram suggests a complicated dynamical regimes consisting of three possible regimes, together with their mixture, for deep NNs and provides a guidance for studying deep NNs in different initialization regimes, which reveals the possibility of completely different dynamics emerging within a deep NN for its different layers.
    Graph Convolutional Reinforcement Learning for Collaborative Queuing Agents. (arXiv:2205.12009v1 [cs.NI])
    In this paper, we explore the use of multi-agent deep learning as well as learning to cooperate principles to meet stringent service level agreements, in terms of throughput and end-to-end delay, for a set of classified network flows. We consider agents built on top of a weighted fair queuing algorithm that continuously set weights for three flow groups: gold, silver, and bronze. We rely on a novel graph-convolution based, multi-agent reinforcement learning approach known as DGN. As benchmarks, we propose centralized and distributed deep Q-network approaches and evaluate their performances in different network, traffic, and routing scenarios, highlighting the effectiveness of our proposals and the importance of agent cooperation. We show that our DGN-based approach meets stringent throughput and delay requirements across all scenarios.
    DNNAbacus: Toward Accurate Computational Cost Prediction for Deep Neural Networks. (arXiv:2205.12095v1 [cs.LG])
    Deep learning is attracting interest across a variety of domains, including natural language processing, speech recognition, and computer vision. However, model training is time-consuming and requires huge computational resources. Existing works on the performance prediction of deep neural networks, which mostly focus on the training time prediction of a few models, rely on analytical models and result in high relative errors. %Optimizing task scheduling and reducing job failures in data centers are essential to improve resource utilization and reduce carbon emissions. This paper investigates the computational resource demands of 29 classical deep neural networks and builds accurate models for predicting computational costs. We first analyze the profiling results of typical networks and demonstrate that the computational resource demands of models with different inputs and hyperparameters are not obvious and intuitive. We then propose a lightweight prediction approach DNNAbacus with a novel network structural matrix for network representation. DNNAbacus can accurately predict both memory and time cost for PyTorch and TensorFlow models, which is also generalized to different hardware architectures and can have zero-shot capability for unseen networks. Our experimental results show that the mean relative error (MRE) is 0.9% with respect to time and 2.8% with respect to memory for 29 classic models, which is much lower than the state-of-the-art works.
    Training Efficient CNNS: Tweaking the Nuts and Bolts of Neural Networks for Lighter, Faster and Robust Models. (arXiv:2205.12050v1 [cs.LG])
    Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. Many techniques have evolved over the past decade that made models lighter, faster, and robust with better generalization. However, many deep learning practitioners persist with pre-trained models and architectures trained mostly on standard datasets such as Imagenet, MS-COCO, IMDB-Wiki Dataset, and Kinetics-700 and are either hesitant or unaware of redesigning the architecture from scratch that will lead to better performance. This scenario leads to inefficient models that are not suitable on various devices such as mobile, edge, and fog. In addition, these conventional training methods are of concern as they consume a lot of computing power. In this paper, we revisit various SOTA techniques that deal with architecture efficiency (Global Average Pooling, depth-wise convolutions & squeeze and excitation, Blurpool), learning rate (Cyclical Learning Rate), data augmentation (Mixup, Cutout), label manipulation (label smoothing), weight space manipulation (stochastic weight averaging), and optimizer (sharpness aware minimization). We demonstrate how an efficient deep convolution network can be built in a phased manner by sequentially reducing the number of training parameters and using the techniques mentioned above. We achieved a SOTA accuracy of 99.2% on MNIST data with just 1500 parameters and an accuracy of 86.01% with just over 140K parameters on the CIFAR-10 dataset.
    On statistic alignment for domain adaptation in structural health monitoring. (arXiv:2205.12052v1 [cs.LG])
    The practical application of structural health monitoring (SHM) is often limited by the availability of labelled data. Transfer learning - specifically in the form of domain adaptation (DA) - gives rise to the possibility of leveraging information from a population of physical or numerical structures, by inferring a mapping that aligns the feature spaces. Typical DA methods rely on nonparametric distance metrics, which require sufficient data to perform density estimation. In addition, these methods can be prone to performance degradation under class imbalance. To address these issues, statistic alignment (SA) is discussed, with a demonstration of how these methods can be made robust to class imbalance, including a special case of class imbalance called a partial DA scenario. SA is demonstrated to facilitate damage localisation with no target labels in a numerical case study, outperforming other state-of-the-art DA methods. It is then shown to be capable of aligning the feature spaces of a real heterogeneous population, the Z24 and KW51 bridges, with only 220 samples used from the KW51 bridge. Finally, in scenarios where more complex mappings are required for knowledge transfer, SA is shown to be a vital pre-processing tool, increasing the performance of established DA methods.
    Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation. (arXiv:2205.11140v2 [cs.LG] UPDATED)
    We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
    Logarithmic regret bounds for continuous-time average-reward Markov decision processes. (arXiv:2205.11168v2 [cs.LG] UPDATED)
    We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
    Deep Neural Network approaches for Analysing Videos of Music Performances. (arXiv:2205.11232v2 [cs.CV] UPDATED)
    This paper presents a framework to automate the labelling process for gestures in musical performance videos with a 3D Convolutional Neural Network (CNN). While this idea was proposed in a previous study, this paper introduces several novelties: (i) Presents a novel method to overcome the class imbalance challenge and make learning possible for co-existent gestures by batch balancing approach and spatial-temporal representations of gestures. (ii) Performs a detailed study on 7 and 18 categories of gestures generated during the performance (guitar play) of musical pieces that have been video-recorded. (iii) Investigates the possibility to use audio features. (iv) Extends the analysis to multiple videos. The novel methods significantly improve the performance of gesture identification by 12 %, when compared to the previous work (51 % in this study over 39 % in previous work). We successfully validate the proposed methods on 7 super classes (72 %), an ensemble of the 18 gestures/classes, and additional videos (75 %).
    GraphMAE: Self-Supervised Masked Graph Autoencoders. (arXiv:2205.10803v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL has seen emerging success in natural language processing and other fields, such as the wide adoption of BERT and GPT. Despite this, contrastive learning--which heavily relies on structural data augmentation and complicated training strategies--has been the dominant approach in graph SSL, while the progress of generative SSL on graphs, especially graph autoencoders (GAEs), has thus far not reached the potential as promised in other fields. In this paper, we identify and examine the issues that negatively impact the development of GAEs, including their reconstruction objective, training robustness, and error metric. We present a masked graph autoencoder GraphMAE that mitigates these issues for generative self-supervised graph learning. Instead of reconstructing structures, we propose to focus on feature reconstruction with both a masking strategy and scaled cosine error that benefit the robust training of GraphMAE. We conduct extensive experiments on 21 public datasets for three different graph learning tasks. The results manifest that GraphMAE--a simple graph autoencoder with our careful designs--can consistently generate outperformance over both contrastive and generative state-of-the-art baselines. This study provides an understanding of graph autoencoders and demonstrates the potential of generative self-supervised learning on graphs.
    Bayesian Active Meta-Learning for Black-Box Optimization. (arXiv:2110.09943v2 [cs.LG] CROSS LISTED)
    Data-efficient learning algorithms are essential in many practical applications for which data collection is expensive, e.g., for the optimal deployment of wireless systems in unknown propagation scenarios. Meta-learning can address this problem by leveraging data from a set of related learning tasks, e.g., from similar deployment settings. In practice, one may have available only unlabeled data sets from the related tasks, requiring a costly labeling procedure to be carried out before use in meta-learning. For instance, one may know the possible positions of base stations in a given area, but not the performance indicators achievable with each deployment. To decrease the number of labeling steps required for meta-learning, this paper introduces an information-theoretic active task selection mechanism, and evaluates an instantiation of the approach for Bayesian optimization of black-box models.
    Stack operation of tensor networks. (arXiv:2203.16338v2 [cs.LG] UPDATED)
    The tensor network, as a facterization of tensors, aims at performing the operations that are common for normal tensors, such as addition, contraction and stacking. However, due to its non-unique network structure, only the tensor network contraction is so far well defined. In this paper, we propose a mathematically rigorous definition for the tensor network stack approach, that compress a large amount of tensor networks into a single one without changing their structures and configurations. We illustrate the main ideas with the matrix product states based machine learning as an example. Our results are compared with the for loop and the efficient coding method on both CPU and GPU.
    Towards Practical Physics-Informed ML Design and Evaluation for Power Grid. (arXiv:2205.03673v2 [cs.LG] UPDATED)
    When applied to a real-world safety critical system like the power grid, general machine learning methods suffer from expensive training, non-physical solutions, and limited interpretability. To address these challenges for power grids, many recent works have explored the inclusion of grid physics (i.e., domain expertise) into their method design, primarily through including system constraints and technical limits, reducing search space and defining meaningful features in latent space. Yet, there is no general methodology to evaluate the practicality of these approaches in power grid tasks, and limitations exist regarding scalability, generalization, interpretability, etc. This work formalizes a new concept of physical interpretability which assesses how a ML model makes predictions in a physically meaningful way and introduces an evaluation methodology that identifies a set of attributes that a practical method should satisfy. Inspired by the evaluation attributes, the paper further develops a novel contingency analysis warm starter for MadIoT cyberattack, based on a conditional Gaussian random field. This method serves as an instance of an ML model that can incorporate diverse domain knowledge and improve on these identified attributes. Experiments validate that the warm starter significantly boosts the efficiency of contingency analysis for MadIoT attack even with shallow NN architectures.
    On Understanding and Mitigating the Dimensional Collapse of Graph Contrastive Learning: a Non-Maximum Removal Approach. (arXiv:2203.12821v2 [cs.LG] UPDATED)
    Graph Contrastive Learning (GCL) has shown promising performance in graph representation learning (GRL) without the supervision of manual annotations. GCL can generate graph-level embeddings by maximizing the Mutual Information (MI) between different augmented views of the same graph (positive pairs). However, the GCL is limited by dimensional collapse, i.e., embedding vectors only occupy a low-dimensional subspace. In this paper, we show that the smoothing effect of the graph pooling and the implicit regularization of the graph convolution are two causes of the dimensional collapse in GCL. To mitigate the above issue, we propose a non-maximum removal graph contrastive learning approach (nmrGCL), which removes "prominent'' dimensions (i.e., contribute most in similarity measurement) for positive pair in the pre-text task. Comprehensive experiments on various benchmark datasets are conducted to demonstrate the effectiveness of nmrGCL, and the results show that our model outperforms the state-of-the-art methods. Source code will be made publicly available.
    Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers. (arXiv:2204.09481v2 [cs.CL] UPDATED)
    Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature on Learning with Disagreements, we look at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion. We evaluate our method on a set of diverse datasets and tasks (sentiment, topic and stance). Furthermore, we show that multiple, noisy label descriptions can be aggregated to boost the performance.
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v2 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    Rethinking Attention-Model Explainability through Faithfulness Violation Test. (arXiv:2201.12114v2 [cs.LG] UPDATED)
    Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. However, in this paper, we find one critical limitation in attention explanations: weakness in identifying the polarity of feature impact. This would be somehow misleading -- features with higher attention weights may not faithfully contribute to model predictions; instead, they can impose suppression effects. With this finding, we reflect on the explainability of current attention-based techniques, such as Attentio$\odot$Gradient and LRP-based attention explanations. We first propose an actionable diagnostic methodology (henceforth faithfulness violation test) to measure the consistency between explanation weights and the impact polarity. Through the extensive experiments, we then show that most tested explanation methods are unexpectedly hindered by the faithfulness violation issue, especially the raw attention. Empirical analyses on the factors affecting violation issues further provide useful observations for adopting explanation methods in attention models.
    Retrieval-Augmented Reinforcement Learning. (arXiv:2202.08417v4 [cs.LG] UPDATED)
    Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. he proposed method facilitates learning agents that at test-time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.
    Logical Fallacy Detection. (arXiv:2202.13758v2 [cs.CL] UPDATED)
    Reasoning is central to human intelligence. However, fallacious arguments are common, and some exacerbate problems such as spreading misinformation about climate change. In this paper, we propose the task of logical fallacy detection, and provide a new dataset (Logic) of logical fallacies generally found in text, together with an additional challenge set for detecting logical fallacies in climate change claims (LogicClimate). Detecting logical fallacies is a hard problem as the model must understand the underlying logical structure of the argument. We find that existing pretrained large language models perform poorly on this task. In contrast, we show that a simple structure-aware classifier outperforms the best language model by 5.46% on Logic and 4.51% on LogicClimate. We encourage future work to explore this task as (a) it can serve as a new reasoning challenge for language models, and (b) it can have potential applications in tackling the spread of misinformation. Our dataset and code are available at https://github.com/causalNLP/logical-fallacy.
    MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data. (arXiv:2203.12369v4 [cs.SD] UPDATED)
    Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results. Incorporating psychoacoustically motivated speech perception metrics as part of model training via a predictor network has recently gained interest. However, the performance of such predictors is limited by the distribution of metric scores that appear in the training data. In this work, we propose MetricGAN+/- (an extension of MetricGAN+, one such metric-motivated system) which introduces an additional network - a "de-generator" which attempts to improve the robustness of the prediction network (and by extension of the generator) by ensuring observation of a wider range of metric scores in training. Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8% (3.05 vs 3.22 PESQ score), as well as better generalisation to unseen noise and speech.
    Convolutional Neural Networks on Graphs with Chebyshev Approximation, Revisited. (arXiv:2202.03580v3 [cs.LG] UPDATED)
    Designing spectral convolutional networks is a challenging problem in graph learning. ChebNet, one of the early attempts, approximates the spectral graph convolutions using Chebyshev polynomials. GCN simplifies ChebNet by utilizing only the first two Chebyshev polynomials while still outperforming it on real-world datasets. GPR-GNN and BernNet demonstrate that the Monomial and Bernstein bases also outperform the Chebyshev basis in terms of learning the spectral graph convolutions. Such conclusions are counter-intuitive in the field of approximation theory, where it is established that the Chebyshev polynomial achieves the optimum convergent rate for approximating a function. In this paper, we revisit the problem of approximating the spectral graph convolutions with Chebyshev polynomials. We show that ChebNet's inferior performance is primarily due to illegal coefficients learnt by ChebNet approximating analytic filter functions, which leads to over-fitting. We then propose ChebNetII, a new GNN model based on Chebyshev interpolation, which enhances the original Chebyshev polynomial approximation while reducing the Runge phenomenon. We conducted an extensive experimental study to demonstrate that ChebNetII can learn arbitrary graph convolutions and achieve superior performance in both full- and semi-supervised node classification tasks. Most notably, we scale ChebNetII to a billion graph ogbn-papers100M, showing that spectral-based GNNs have superior performance.
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v2 [cs.LG] UPDATED)
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
    Training Differentially Private Models with Secure Multiparty Computation. (arXiv:2202.02625v2 [cs.CR] UPDATED)
    We address the problem of learning a machine learning model from training data that originates at multiple data owners while providing formal privacy guarantees regarding the protection of each owner's data. Existing solutions based on Differential Privacy (DP) achieve this at the cost of a drop in accuracy. Solutions based on Secure Multiparty Computation (MPC) do not incur such accuracy loss but leak information when the trained model is made publicly available. We propose an MPC solution for training DP models. Our solution relies on an MPC protocol for model training, and an MPC protocol for perturbing the trained model coefficients with Laplace noise in a privacy-preserving manner. The resulting MPC+DP approach achieves higher accuracy than a pure DP approach while providing the same formal privacy guarantees. Our work obtained first place in the iDASH2021 Track III competition on confidential computing for secure genome analysis.
    PrivFair: a Library for Privacy-Preserving Fairness Auditing. (arXiv:2202.04058v3 [cs.LG] UPDATED)
    Machine learning (ML) has become prominent in applications that directly affect people's quality of life, including in healthcare, justice, and finance. ML models have been found to exhibit discrimination based on sensitive attributes such as gender, race, or disability. Assessing if an ML model is free of bias remains challenging to date, and by definition has to be done with sensitive user characteristics that are subject of anti-discrimination and data protection law. Existing libraries for fairness auditing of ML models offer no mechanism to protect the privacy of the audit data. We present PrivFair, a library for privacy-preserving fairness audits of ML models. Through the use of Secure Multiparty Computation (MPC), PrivFair protects the confidentiality of the model under audit and the sensitive data used for the audit, hence it supports scenarios in which a proprietary classifier owned by a company is audited using sensitive audit data from an external investigator. We demonstrate the use of PrivFair for group fairness auditing with tabular data or image data, without requiring the investigator to disclose their data to anyone in an unencrypted manner, or the model owner to reveal their model parameters to anyone in plaintext.
    On the identifiability of mixtures of ranking models. (arXiv:2201.13132v2 [cs.LG] UPDATED)
    Mixtures of ranking models are standard tools for ranking problems. However, even the fundamental question of parameter identifiability is not fully understood: the identifiability of a mixture model with two Bradley-Terry-Luce (BTL) components has remained open. In this work, we show that popular mixtures of ranking models with two components (BTL, multinomial logistic models with slates of size 3, or Plackett-Luce) are generically identifiable, i.e., the ground-truth parameters can be identified except when they are from a pathological subset of measure zero. We provide a framework for verifying the number of solutions in a general family of polynomial systems using algebraic geometry, and apply it to these mixtures of ranking models to establish generic identifiability. The framework can be applied more broadly to other learning models and may be of independent interest.
    Identifying Dementia Subtypes with Electronic Health Records. (arXiv:2202.00009v2 [cs.LG] UPDATED)
    Dementia is characterized by a decline in memory and thinking that is significant enough to impair function in activities of daily living. Patients seen in dementia specialty clinics are highly heterogeneous with a variety of different symptoms that progress at different rates. In this work, we used an unsupervised data-driven K-Means clustering approach on the component scores of the Clinical Dementia Rating (CDR) score to identify dementia subtypes and used the gap-statistic to identify the optimal number of clusters. Our goal was to characterize the identified dementia subtypes in terms of their cognitive performance and analyze how patient transitions between subtypes relate to disease progression. Our results indicate both inter-subtype variability, which indicates the variability amongst dementia subtypes for a particular component score even with the same CDR and (ii) intra-subtype variability, which indicates the variation in the 6 component scores within a particular dementia subtype. We observed that dementia subtypes that represented individuals with very mild dementia (CDR 0.5) had widely varying rates of transition to other subtypes. Future work includes testing the generalizability of our proposed pipeline on additional datasets, and using a larger volume of EHR data to estimate probabilistic estimates of the variability between dementia subtypes both in terms of cognitive profile and disease progression.
    Efficient Strong Scaling Through Burst Parallel Training. (arXiv:2112.10065v3 [cs.DC] UPDATED)
    As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2 - 2.3x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.
    Combining optimal path search with task-dependent learning in a neural network. (arXiv:2201.11104v3 [cs.LG] UPDATED)
    Finding optimal paths in connected graphs requires determining the smallest total cost for traveling along the graph's edges. This problem can be solved by several classical algorithms where, usually, costs are predefined for all edges. Conventional planning methods can, thus, normally not be used when wanting to change costs in an adaptive way following the requirements of some task. Here we show that one can define a neural network representation of path finding problems by transforming cost values into synaptic weights, which allows for online weight adaptation using network learning mechanisms. When starting with an initial activity value of one, activity propagation in this network will lead to solutions, which are identical to those found by the Bellman Ford algorithm. The neural network has the same algorithmic complexity as Bellman Ford and, in addition, we can show that network learning mechanisms (such as Hebbian learning) can adapt the weights in the network augmenting the resulting paths according to some task at hand. We demonstrate this by learning to navigate in an environment with obstacles as well as by learning to follow certain sequences of path nodes. Hence, the here-presented novel algorithm may open up a different regime of applications where path-augmentation (by learning) is directly coupled with path finding in a natural way.
    Balanced Graph Structure Learning for Multivariate Time Series Forecasting. (arXiv:2201.09686v2 [cs.LG] UPDATED)
    Accurate forecasting of multivariate time series is an extensively studied subject in finance, transportation, and computer science. Fully mining the correlation and causation between the variables in a multivariate time series exhibits noticeable results in improving the performance of a time series model. Recently, some models have explored the dependencies between variables through end-to-end graph structure learning without the need for predefined graphs. However, current models do not incorporate the trade-off between efficiency and flexibility and lack the guidance of domain knowledge in the design of graph structure learning algorithms. This paper alleviates the above issues by proposing Balanced Graph Structure Learning for Forecasting (BGSLF), a novel deep learning model that joins graph structure learning and forecasting. Technically, BGSLF leverages the spatial information into convolutional operations and extracts temporal dynamics using the diffusion convolutional recurrent network. The proposed framework balance the trade-off between efficiency and flexibility by introducing Multi-Graph Generation Network (MGN) and Graph Selection Module. In addition, a method named Smooth Sparse Unit (SSU) is designed to sparse the learned graph structures, which conforms to the sparse spatial correlations in the real world. Extensive experiments on four real-world datasets demonstrate that our model achieves state-of-the-art performances with minor trainable parameters. Code will be made publicly available.
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v2 [cs.LG] UPDATED)
    This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.
    PowerGraph: Using neural networks and principal components to determine multivariate statistical power trade-offs. (arXiv:2201.00719v2 [stat.ME] UPDATED)
    Statistical power estimation for studies with multiple model parameters is inherently a multivariate problem. Power for individual parameters of interest cannot be reliably estimated univariately since correlation and variance explained relative to one parameter will impact the power for another parameter, all usual univariate considerations being equal. Explicit solutions in such cases, especially for models with many parameters, are either impractical or impossible to solve, leaving researchers to the prevailing method of simulating power. However, the point estimates for a vector of model parameters are uncertain, and the impact of inaccuracy is unknown. In such cases, sensitivity analysis is recommended such that multiple combinations of possible observable parameter vectors are simulated to understand power trade-offs. A limitation to this approach is that it is computationally expensive to generate sufficient sensitivity combinations to accurately map the power trade-off function in increasingly high-dimensional spaces for the models that social scientists estimate. This paper explores the efficient estimation and graphing of statistical power for a study over varying model parameter combinations. We propose a simple and generalizable machine learning inspired solution to cut the computational cost to less than 10% of the brute force method while providing F1 scores above 90%. We further motivate the impact of transfer learning in learning power manifolds across varying distributions.
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v2 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions, we use Hamiltonian Monte Carlo sampling. We demonstrate our approach in a simulation study with hemodynamical models, which are time-dependent differential equations. Data are simulated from a more complex model than our modelling choice, and the aim is to learn physical parameters according to known mathematical connections. To demonstrate the flexibility of our approach, an example using the Heat equation, a space-time dependent differential equation where we consider a case of a biased data-acquisition process is also included. Finally, we fit the hemodynamic model using real data obtained in a medical trial.
    Swift and Sure: Hardness-aware Contrastive Learning for Low-dimensional Knowledge Graph Embeddings. (arXiv:2201.00565v2 [cs.LG] UPDATED)
    Knowledge graph embedding (KGE) has shown great potential in automatic knowledge graph (KG) completion and knowledge-driven tasks. However, recent KGE models suffer from high training cost and large storage space, thus limiting their practicality in real-world applications. To address this challenge, based on the latest findings in the field of Contrastive Learning, we propose a novel KGE training framework called Hardness-aware Low-dimensional Embedding (HaLE). Instead of the traditional Negative Sampling, we design a new loss function based on query sampling that can balance two important training targets, Alignment and Uniformity. Furthermore, we analyze the hardness-aware ability of recent low-dimensional hyperbolic models and propose a lightweight hardness-aware activation mechanism. The experimental results show that in the limited training time, HaLE can effectively improve the performance and training speed of KGE models on five commonly-used datasets. After training just a few minutes, the HaLE-trained models are competitive compared to the state-of-the-art models in both low- and high-dimensional conditions.
    OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion. (arXiv:2112.06061v2 [cs.RO] UPDATED)
    Muscle-actuated control is a research topic that spans multiple domains, including biomechanics, neuroscience, reinforcement learning, robotics, and graphics. This type of control is particularly challenging as bodies are often overactuated and dynamics are delayed and non-linear. It is however a very well tested and tuned actuation mechanism that has undergone millions of years of evolution with interesting properties exploiting passive forces and efficient energy storage of muscle-tendon units. To facilitate research on muscle-actuated simulation, we release a 3D musculoskeletal simulation of an ostrich based on the MuJoCo physics engine. The ostrich is one of the fastest bipeds on earth and therefore makes an excellent model for studying muscle-actuated bipedal locomotion. The model is based on CT scans and dissections used to collect actual muscle data, such as insertion sites, lengths, and pennation angles. Along with this model, we also provide a set of reinforcement learning tasks, including reference motion tracking, running, and neck control, used to infer muscle actuation patterns. The reference motion data is based on motion capture clips of various behaviors that we preprocessed and adapted to our model. This paper describes how the model was built and iteratively improved using the tasks. We also evaluate the accuracy of the muscle actuation patterns by comparing them to experimentally collected electromyographic data from locomoting birds. The results demonstrate the need for rich reward signals or regularization techniques to constrain muscle excitations and produce realistic movements. Overall, we believe that this work can provide a useful bridge between fields of research interested in muscle actuation.
    A Review of Indoor Millimeter Wave Device-based Localization and Device-free Sensing Technologies and Applications. (arXiv:2112.05593v2 [cs.NI] UPDATED)
    The commercial availability of low-cost millimeter wave (mmWave) communication and radar devices is starting to improve the penetration of such technologies in consumer markets, paving the way for large-scale and dense deployments in fifth-generation (5G)-and-beyond as well as 6G networks. At the same time, pervasive mmWave access will enable device localization and device-free sensing with unprecedented accuracy, especially with respect to sub-6 GHz commercial-grade devices. This paper surveys the state of the art in device-based localization and device-free sensing using mmWave communication and radar devices, with a focus on indoor deployments. We first overview key concepts about mmWave signal propagation and system design. Then, we provide a detailed account of approaches and algorithms for localization and sensing enabled by mmWaves. We consider several dimensions in our analysis, including the main objectives, techniques, and performance of each work, whether each research reached some degree of implementation, and which hardware platforms were used for this purpose. We conclude by discussing that better algorithms for consumer-grade devices, data fusion methods for dense deployments, as well as an educated application of machine learning methods are promising, relevant and timely research directions.
    Spatio-Temporal Modeling for Flash Memory Channels Using Conditional Generative Nets. (arXiv:2111.10039v2 [eess.SY] UPDATED)
    We propose a data-driven approach to modeling the spatio-temporal characteristics of NAND flash memory read voltages using conditional generative networks. The learned model reconstructs read voltages from an individual memory cell based on the program levels of the cell and its surrounding cells, as well as the specified program/erase (P/E) cycling time stamp. We evaluate the model over a range of time stamps using the cell read voltage distributions, the cell level error rates, and the relative frequency of errors for patterns most susceptible to inter-cell interference (ICI) effects. We conclude that the model accurately captures the spatial and temporal features of the flash memory channel.
    Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization. (arXiv:2110.10832v3 [cs.LG] UPDATED)
    In Domain Generalization (DG) settings, models trained independently on a given set of training domains have notoriously chaotic performance on distribution shifted test domains, and stochasticity in optimization (e.g. seed) plays a big role. This makes deep learning models unreliable in real world settings. We first show that this chaotic behavior exists even along the training optimization trajectory of a single model, and propose a simple model averaging protocol that both significantly boosts domain generalization and diminishes the impact of stochasticity by improving the rank correlation between the in-domain validation accuracy and out-domain test accuracy, which is crucial for reliable early stopping. Taking advantage of our observation, we show that instead of ensembling unaveraged models (that is typical in practice), ensembling moving average models (EoA) from independent runs further boosts performance. We theoretically explain the boost in performance of ensembling and model averaging by adapting the well known Bias-Variance trade-off to the domain generalization setting. On the DomainBed benchmark, when using a pre-trained ResNet-50, this ensemble of averages achieves an average of $68.0\%$, beating vanilla ERM (w/o averaging/ensembling) by $\sim 4\%$, and when using a pre-trained RegNetY-16GF, achieves an average of $76.6\%$, beating vanilla ERM by $6\%$. Our code is available at \url{https://github.com/salesforce/ensemble-of-averages}.
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v6 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures and topologies, and even outperforms supervised methods.
    Joint Embedding of Structural and Functional Brain Networks with Graph Neural Networks for Mental Illness Diagnosis. (arXiv:2107.03220v2 [q-bio.NC] UPDATED)
    Multimodal brain networks characterize complex connectivities among different brain regions from both structural and functional aspects and provide a new means for mental disease analysis. Recently, Graph Neural Networks (GNNs) have become a de facto model for analyzing graph-structured data. However, how to employ GNNs to extract effective representations from brain networks in multiple modalities remains rarely explored. Moreover, as brain networks provide no initial node features, how to design informative node attributes and leverage edge weights for GNNs to learn is left unsolved. To this end, we develop a novel multiview GNN for multimodal brain networks. In particular, we regard each modality as a view for brain networks and employ contrastive learning for multimodal fusion. Then, we propose a GNN model which takes advantage of the message passing scheme by propagating messages based on degree statistics and brain region connectivities. Extensive experiments on two real-world disease datasets (HIV and Bipolar) demonstrate the effectiveness of our proposed method over state-of-the-art baselines.
    Nonnegative Tensor Completion via Integer Optimization. (arXiv:2111.04580v2 [cs.LG] UPDATED)
    Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries.
    Learning Deep Representation with Energy-Based Self-Expressiveness for Subspace Clustering. (arXiv:2110.15037v2 [cs.LG] UPDATED)
    Deep subspace clustering has attracted increasing attention in recent years. Almost all the existing works are required to load the whole training data into one batch for learning the self-expressive coefficients in the framework of deep learning. Although these methods achieve promising results, such a learning fashion severely prevents from the usage of deeper neural network architectures (e.g., ResNet), leading to the limited representation abilities of the models. In this paper, we propose a new deep subspace clustering framework, motivated by the energy-based models. In contrast to previous approaches taking the weights of a fully connected layer as the self-expressive coefficients, we propose to learn an energy-based network to obtain the self-expressive coefficients by mini-batch training. By this means, it is no longer necessary to load all data into one batch for learning, and it thus becomes a reality that we can utilize deeper neural network models for subspace clustering. Considering the powerful representation ability of the recently popular self-supervised learning, we attempt to leverage self-supervised representation learning to learn the dictionary. Finally, we propose a joint framework to learn both the self-expressive coefficients and dictionary simultaneously, and train the model in an end-to-end manner. The experiments are performed on three publicly available datasets, and extensive experimental results demonstrate our method can significantly outperform the other related approaches. For instance, on the three datasets, our method can averagely achieve $13.8\%$, $15.4\%$, $20.8\%$ improvements in terms of Accuracy, NMI, and ARI over SENet which is proposed very recently and obtains the second best results in the experiments.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v5 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    Approximated Multi-Agent Fitted Q Iteration. (arXiv:2104.09343v4 [cs.LG] UPDATED)
    We formulate an efficient approximation for multi-agent batch reinforcement learning, the approximated multi-agent fitted Q iteration (AMAFQI). We present a detailed derivation of our approach. We propose an iterative policy search and show that it yields a greedy policy with respect to multiple approximations of the centralized, learned Q-function. In each iteration and policy evaluation, AMAFQI requires a number of computations that scales linearly with the number of agents whereas the analogous number of computations increase exponentially for the fitted Q iteration (FQI), a commonly used approaches in batch reinforcement learning. This property of AMAFQI is fundamental for the design of a tractable multi-agent approach. We evaluate the performance of AMAFQI and compare it to FQI in numerical simulations. The simulations illustrate the significant computation time reduction when using AMAFQI instead of FQI in multi-agent problems and corroborate the similar performance of both approaches.
    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. (arXiv:2110.07205v3 [eess.AS] UPDATED)
    Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.
    Differentiable Architecture Search for Reinforcement Learning. (arXiv:2106.02229v3 [cs.LG] UPDATED)
    In this paper, we investigate the fundamental question: To what extent are gradient-based neural architecture search (NAS) techniques applicable to RL? Using the original DARTS as a convenient baseline, we discover that the discrete architectures found can achieve up to 250% performance compared to manual architecture designs on both discrete and continuous action space environments across off-policy and on-policy RL algorithms, at only 3x more computation time. Furthermore, through numerous ablation studies, we systematically verify that not only does DARTS correctly upweight operations during its supernet phrase, but also gradually improves resulting discrete cells up to 30x more efficiently than random search, suggesting DARTS is surprisingly an effective tool for improving architectures in RL.
    A multi-stage machine learning model on diagnosis of esophageal manometry. (arXiv:2106.13869v2 [cs.LG] UPDATED)
    High-resolution manometry (HRM) is the primary procedure used to diagnose esophageal motility disorders. Its interpretation and classification includes an initial evaluation of swallow-level outcomes and then derivation of a study-level diagnosis based on Chicago Classification (CC), using a tree-like algorithm. This diagnostic approach on motility disordered using HRM was mirrored using a multi-stage modeling framework developed using a combination of various machine learning approaches. Specifically, the framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage. In the swallow-level stage, three models based on convolutional neural networks (CNNs) were developed to predict swallow type, swallow pressurization, and integrated relaxation pressure (IRP). At the study-level stage, model selection from families of the expert-knowledge-based rule models, xgboost models and artificial neural network(ANN) models were conducted, with the latter two model designed and augmented with motivation from the export knowledge. A simple model-agnostic strategy of model balancing motivated by Bayesian principles was utilized, which gave rise to model averaging weighted by precision scores. The averaged (blended) models and individual models were compared and evaluated, of which the best performance on test dataset is 0.81 in top-1 prediction, 0.92 in top-2 predictions. This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data. Moreover, the proposed modeling framework could be easily extended to multi-modal tasks, such as diagnosis of esophageal patients based on clinical data from both HRM and functional luminal imaging probe panometry (FLIP).
    Learning Time-Varying Graphs from Online Data. (arXiv:2110.11017v2 [cs.LG] UPDATED)
    This work proposes an algorithmic framework to learn time-varying graphs from online data. The generality offered by the framework renders it model-independent, i.e., it can be theoretically analyzed in its abstract formulation and then instantiated under a variety of model-dependent graph learning problems. This is possible by phrasing (time-varying) graph learning as a composite optimization problem, where different functions regulate different desiderata, e.g., data fidelity, sparsity or smoothness. Instrumental for the findings is recognizing that the dependence of the majority (if not all) data-driven graph learning algorithms on the data is exerted through the empirical covariance matrix, representing a sufficient statistic for the estimation problem. Its user-defined recursive update enables the framework to work in non-stationary environments, while iterative algorithms building on novel time-varying optimization tools explicitly take into account the temporal dynamics, speeding up convergence and implicitly including a temporal-regularization of the solution. We specialize the framework to three well-known graph learning models, namely, the Gaussian graphical model (GGM), the structural equation model (SEM), and the smoothness-based model (SBM), where we also introduce ad-hoc vectorization schemes for structured matrices (symmetric, hollows, etc.) which are crucial to perform correct gradient computations, other than enabling to work in low-dimensional vector spaces and hence easing storage requirements. After discussing the theoretical guarantees of the proposed framework, we corroborate it with extensive numerical tests in synthetic and real data.
    Generalizing to Unseen Domains: A Survey on Domain Generalization. (arXiv:2103.03097v7 [cs.LG] UPDATED)
    Machine learning systems generally assume that the training and testing distributions are the same. To this end, a key requirement is to develop models that can generalize to unseen distributions. Domain generalization (DG), i.e., out-of-distribution generalization, has attracted increasing interests in recent years. Domain generalization deals with a challenging setting where one or several different but related domain(s) are given, and the goal is to learn a model that can generalize to an unseen test domain. Great progress has been made in the area of domain generalization for years. This paper presents the first review of recent advances in this area. First, we provide a formal definition of domain generalization and discuss several related fields. We then thoroughly review the theories related to domain generalization and carefully analyze the theory behind generalization. We categorize recent algorithms into three classes: data manipulation, representation learning, and learning strategy, and present several popular algorithms in detail for each category. Third, we introduce the commonly used datasets, applications, and our open-sourced codebase for fair evaluation. Finally, we summarize existing literature and present some potential research topics for the future.
    Out-of-Distribution Dynamics Detection: RL-Relevant Benchmarks and Results. (arXiv:2107.04982v2 [cs.LG] UPDATED)
    We study the problem of out-of-distribution dynamics (OODD) detection, which involves detecting when the dynamics of a temporal process change compared to the training-distribution dynamics. This is relevant to applications in control, reinforcement learning (RL), and multi-variate time-series, where changes to test time dynamics can impact the performance of learning controllers/predictors in unknown ways. This problem is particularly important in the context of deep RL, where learned controllers often overfit to the training environment. Currently, however, there is a lack of established OODD benchmarks for the types of environments commonly used in RL research. Our first contribution is to design a set of OODD benchmarks derived from common RL environments with varying types and intensities of OODD. Our second contribution is to design a strong OODD baseline approach based on recurrent implicit quantile network (RIQN), which monitors autoregressive prediction errors for OODD detection. In addition to RIQN, we introduce and test three other simpler baselines. Our final contribution is to evaluate our baseline approaches on the benchmarks to provide results for future comparison.
    Towards Continual Knowledge Learning of Language Models. (arXiv:2110.03215v4 [cs.CL] UPDATED)
    Large Language Models (LMs) are known to encode world knowledge in their parameters as they pretrain on a vast amount of web corpus, which is often utilized for performing knowledge-dependent downstream tasks such as question answering, fact-checking, and open dialogue. In real-world scenarios, the world knowledge stored in the LMs can quickly become outdated as the world changes, but it is non-trivial to avoid catastrophic forgetting and reliably acquire new knowledge while preserving invariant knowledge. To push the community towards better maintenance of ever-changing LMs, we formulate a new continual learning (CL) problem called Continual Knowledge Learning (CKL). We construct a new benchmark and metric to quantify the retention of time-invariant world knowledge, the update of outdated knowledge, and the acquisition of new knowledge. We adopt applicable recent methods from literature to create several strong baselines. Through extensive experiments, we find that CKL exhibits unique challenges that are not addressed in previous CL setups, where parameter expansion is necessary to reliably retain and learn knowledge simultaneously. By highlighting the critical causes of knowledge forgetting, we show that CKL is a challenging and important problem that helps us better understand and train ever-changing LMs. The benchmark datasets, evaluation script, and baseline code to reproduce our results are available at https://github.com/joeljang/continual-knowledge-learning.
    Recent Advances on Neural Network Pruning at Initialization. (arXiv:2103.06460v3 [cs.LG] UPDATED)
    Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to PaT and discuss the open problems. Apart from the dedicated literature review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.
    Quasi-Equivalence of Width and Depth of Neural Networks. (arXiv:2002.02515v7 [cs.LG] UPDATED)
    While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.
    How much human-like visual experience do current self-supervised learning algorithms need in order to achieve human-level object recognition?. (arXiv:2109.11523v3 [cs.CV] UPDATED)
    This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like" natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is several orders of magnitude longer than a human lifetime: typically on the order of a million to a billion years of natural visual experience (depending on the algorithm used). We obtain even larger estimates for achieving human-level performance in ImageNet-derived robustness benchmarks. The exact values of these estimates are sensitive to some underlying assumptions, however even in the most optimistic scenarios they remain orders of magnitude larger than a human lifetime. We discuss the main caveats surrounding our estimates and the implications of these surprising results.
    Towards A Measure Of General Machine Intelligence. (arXiv:2109.12075v4 [cs.AI] UPDATED)
    To build general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the generalization index, or the g-index, to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.
    3D Infomax improves GNNs for Molecular Property Prediction. (arXiv:2110.04126v3 [cs.LG] UPDATED)
    Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the geometry of molecules given only their 2D molecular graphs. Using methods from self-supervised learning, we maximize the mutual information between 3D summary vectors and the representations of a Graph Neural Network (GNN) such that they contain latent 3D information. During fine-tuning on molecules with unknown geometry, the GNN still generates implicit 3D information and can use it to improve downstream tasks. We show that 3D pre-training provides significant improvements for a wide range of properties, such as a 22% average MAE reduction on eight quantum mechanical properties. Moreover, the learned representations can be effectively transferred between datasets in different molecular spaces.
    Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey. (arXiv:2101.00734v2 [stat.ML] UPDATED)
    This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They assume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.
    FedTune: Automatic Tuning of Federated Learning Hyper-Parameters from System Perspective. (arXiv:2110.03061v4 [cs.LG] UPDATED)
    Federated learning (FL) hyper-parameters significantly affect the training overheads in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters puts a high burden on FL practitioners since various applications prefer different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements of FL training. FedTune is lightweight and flexible, achieving 8.48%-26.75% improvement for different datasets compared to fixed FL hyper-parameters.
    Weakly-supervised Multi-output Regression via Correlated Gaussian Processes. (arXiv:2002.08412v2 [stat.ML] UPDATED)
    Multi-output regression seeks to borrow strength and leverage commonalities across different but related outputs in order to enhance learning and prediction accuracy. A fundamental assumption is that the output/group membership labels for all observations are known. This assumption is often violated in real applications. For instance, in healthcare datasets, sensitive attributes such as ethnicity are often missing or unreported. To this end, we introduce a weakly-supervised multi-output model based on dependent Gaussian processes. Our approach is able to leverage data without complete group labels or possibly only prior belief on group memberships to enhance accuracy across all outputs. Through intensive simulations and case studies on an Insulin, Testosterone and Bodyfat dataset, we show that our model excels in multi-output settings with missing labels, while being competitive in traditional fully labeled settings. We end by highlighting the possible use of our approach in fair inference and sequential decision-making.
    Effects of Image Size on Deep Learning. (arXiv:2101.11508v3 [cs.CV] UPDATED)
    This paper presents the evaluation of the effects of image size on deep learning performance via semantic segmentation of magnetic resonance heart images with U-net for fully automated quantification of myocardial infarction. Both non-extra pixel and extra pixel interpolation algorithms are used to change the size of images in datasets of interest. Extra class labels, in interpolated ground truth segmentation images, are removed using thresholding, median filtering, and subtraction strategies. Common class metrics are used to evaluate the quality of semantic segmentation with U-net against the ground truth segmentation while arbitrary threshold, comparison of the sums, and sums of differences between medical experts and fully automated results are options used to estimate the relationship between medical experts-based quantification and fully automated quantification results.
    MAGMA: Inference and Prediction with Multi-Task Gaussian Processes. (arXiv:2007.10731v2 [stat.CO] UPDATED)
    A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyper-parameters optimisation and hyper-posterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multi-task GP models. Our overall algorithm is called \textsc{Magma} (standing for Multi tAsk Gaussian processes with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
    RevUp: Revise and Update Information Bottleneck for Event Representation. (arXiv:2205.12248v1 [cs.LG])
    In machine learning, latent variables play a key role to capture the underlying structure of data, but they are often unsupervised. When we have side knowledge that already has high-level information about the input data, we can use that source to guide latent variables and capture the available background information in a process called "parameter injection." In that regard, we propose a semi-supervised information bottleneck-based model that enables the use of side knowledge, even if it is noisy and imperfect, to direct the learning of discrete latent variables. Fundamentally, we introduce an auxiliary continuous latent variable as a way to reparameterize the model's discrete variables with a light-weight hierarchical structure. With this reparameterization, the model's discrete latent variables are learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes an existing method of parameter injection, and perform an empirical case study of our approach on language-based event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.
    DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction. (arXiv:2007.11972v4 [stat.ML] UPDATED)
    In spatial statistics, a common objective is to predict values of a spatial process at unobserved locations by exploiting spatial dependence. Kriging provides the best linear unbiased predictor using covariance functions and is often associated with Gaussian processes. However, when considering non-linear prediction for non-Gaussian and categorical data, the Kriging prediction is no longer optimal, and the associated variance is often overly optimistic. Although deep neural networks (DNNs) are widely used for general classification and prediction, they have not been studied thoroughly for data with spatial dependence. In this work, we propose a novel DNN structure for spatial prediction, where the spatial dependence is captured by adding an embedding layer of spatial coordinates with basis functions. We show in theory and simulation studies that the proposed DeepKriging method has a direct link to Kriging in the Gaussian case, and it has multiple advantages over Kriging for non-Gaussian and non-stationary data, i.e., it provides non-linear predictions and thus has smaller approximation errors, it does not require operations on covariance matrices and thus is scalable for large datasets, and with sufficiently many hidden neurons, it provides the optimal prediction in terms of model capacity. We further explore the possibility of quantifying prediction uncertainties based on density prediction without assuming any data distribution. Finally, we apply the method to predicting PM2.5 concentrations across the continental United States.
    Risk-Sensitive Reinforcement Learning via Policy Gradient Search. (arXiv:1810.09126v3 [cs.LG] UPDATED)
    The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formulation, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., exponential utility, variance, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, prospect theory and its later enhancement, cumulative prospect theory. In this book, we consider risk-sensitive RL in two settings: one where the goal is to find a policy that optimizes the usual expected value objective while ensuring that a risk constraint is satisfied, and the other where the risk measure is the objective. We survey some of the recent work in this area specifically where policy gradient search is the solution approach. In the first risk-sensitive RL setting, we cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for policy gradient-based risk-sensitive RL algorithms using a Lagrangian formulation. For the setting where risk is incorporated directly into the objective function, we consider an exponential utility formulation, cumulative prospect theory, and coherent risk measures. This non-exhaustive survey aims to give a flavor of the challenges involved in solving risk-sensitive RL problems using policy gradient methods, as well as outlining some potential future research directions.
    DoorGym: A Scalable Door Opening Environment And Baseline Agent. (arXiv:1908.01887v4 [cs.RO] UPDATED)
    In order to practically implement the door opening task, a policy ought to be robust to a wide distribution of door types and environment settings. Reinforcement Learning (RL) with Domain Randomization (DR) is a promising technique to enforce policy generalization, however, there are only a few accessible training environments that are inherently designed to train agents in domain randomized environments. We introduce DoorGym, an open-source door opening simulation framework designed to utilize domain randomization to train a stable policy. We intend for our environment to lie at the intersection of domain transfer, practical tasks, and realism. We also provide baseline Proximal Policy Optimization and Soft Actor-Critic implementations, which achieves success rates between 0% up to 95% for opening various types of doors in this environment. Moreover, the real-world transfer experiment shows the trained policy is able to work in the real world. Environment kit available here: https://github.com/PSVL/DoorGym/
    Modular Meta-Learning for Power Control via Random Edge Graph Neural Networks. (arXiv:2108.13178v2 [cs.NI] CROSS LISTED)
    In this paper, we consider the problem of power control for a wireless network with an arbitrarily time-varying topology, including the possible addition or removal of nodes. A data-driven design methodology that leverages graph neural networks (GNNs) is adopted in order to efficiently parametrize the power control policy mapping the channel state information (CSI) to transmit powers. The specific GNN architecture, known as random edge GNN (REGNN), defines a non-linear graph convolutional filter whose spatial weights are tied to the channel coefficients. While prior work assumed a joint training approach whereby the REGNN-based policy is shared across all topologies, this paper targets adaptation of the power control policy based on limited CSI data regarding the current topology. To this end, we propose a novel modular meta-learning technique that enables the efficient optimization of module assignment. While black-box meta-learning optimizes a general-purpose adaptation procedure via (stochastic) gradient descent, modular meta-learning finds a set of reusable modules that can form components of a solution for any new network topology. Numerical results validate the benefits of meta-learning for power control problems over joint training schemes, and demonstrate the advantages of modular meta-learning when data availability is extremely limited.
    EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling. (arXiv:2205.12243v1 [stat.ML])
    This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial defense; and 3) longrun sampling for principled modeling of image probability densities. To achieve these outcomes, we introduce three novel methods of MCMC initialization for negative samples used in Maximum Likelihood (ML) learning. With standard network architectures and an unaltered ML objective, our MCMC initialization methods alone enable significant performance gains across the three applications that we investigate. Our results include state-of-the-art FID scores for unnormalized image densities on the CIFAR-10 and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among purification methods and the first EBM defense on ImageNet; and scalable techniques for learning valid probability densities. Code for this project can be found at https://github.com/point0bar1/ebm-life-cycle.
    Taming the sign problem of explicitly antisymmetrized neural networks via rough activation functions. (arXiv:2205.12250v1 [cs.LG])
    Explicit antisymmetrization of a two-layer neural network is a potential candidate for a universal function approximator for generic antisymmetric functions, which are ubiquitous in quantum physics. However, this strategy suffers from a sign problem, namely, due to near exact cancellation of positive and negative contributions, the magnitude of the antisymmetrized function may be significantly smaller than that before antisymmetrization. We prove that the severity of the sign problem is directly related to the smoothness of the activation function. For smooth activation functions (e.g., $\tanh$), the sign problem of the explicitly antisymmetrized two-layer neural network deteriorates super-polynomially with respect to the system size. On the other hand, for rough activation functions (e.g., ReLU), the deterioration rate of the sign problem can be tamed to be at most polynomial with respect to the system size. Finally, the cost of a direct implementation of antisymmetrized two-layer neural network scales factorially with respect to the system size. We describe an efficient algorithm for approximate evaluation of such a network, of which the cost scales polynomially with respect to the system size and inverse precision.
    Interpretation Quality Score for Measuring the Quality of interpretability methods. (arXiv:2205.12254v1 [cs.CL])
    Machine learning (ML) models have been applied to a wide range of natural language processing (NLP) tasks in recent years. In addition to making accurate decisions, the necessity of understanding how models make their decisions has become apparent in many applications. To that end, many interpretability methods that help explain the decision processes of ML models have been developed. Yet, there currently exists no widely-accepted metric to evaluate the quality of explanations generated by these methods. As a result, there currently is no standard way of measuring to what degree an interpretability method achieves an intended objective. Moreover, there is no accepted standard of performance by which we can compare and rank the current existing interpretability methods. In this paper, we propose a novel metric for quantifying the quality of explanations generated by interpretability methods. We compute the metric on three NLP tasks using six interpretability methods and present our results.
    Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Query Attacks. (arXiv:2205.12134v1 [cs.LG])
    The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our proposed AAA helps WideResNet-28 secure $80.59\%$ accuracy under Square attack ($2500$ queries), while the best prior defense (i.e., adversarial training) only attains $67.44\%$. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets and bounds. Moreover, AAA calibrates better without hurting the accuracy. Our code would be released.
    Asynchronous Neural Networks for Learning in Graphs. (arXiv:2205.12245v1 [cs.LG])
    This paper studies asynchronous message passing (AMP), a new paradigm for applying neural network based learning to graphs. Existing graph neural networks use the synchronous distributed computing model and aggregate their neighbors in each round, which causes problems such as oversmoothing and limits their expressiveness. On the other hand, AMP is based on the asynchronous model, where nodes react to messages of their neighbors individually. We prove that (i) AMP can simulate synchronous GNNs and that (ii) AMP can theoretically distinguish any pair of graphs. We experimentally validate AMP's expressiveness. Further, we show that AMP might be better suited to propagate messages over large distances in graphs and performs well on several graph classification benchmarks.
    One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks. (arXiv:2205.12141v1 [cs.LG])
    Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Error-minimizing noise, which is injected to clean data, is one of the most successful methods for preventing DNNs from giving correct predictions on incoming new data. Nonetheless, under specific training strategies such as adversarial training, the unlearnability of error-minimizing noise will severely degrade. In addition, the transferability of error-minimizing noise is inherently limited by the mismatch between the generator model and the targeted learner model. In this paper, we investigate the mechanism of unlearnable examples and propose a novel model-free method, named \emph{One-Pixel Shortcut}, which only perturbs a single pixel of each image and makes the dataset unlearnable. Our method needs much less computational cost and obtains stronger transferability and thus can protect data from a wide range of different models. Based on this, we further introduce the first unlearnable dataset called CIFAR-10-S, which is indistinguishable from normal CIFAR-10 by human observers and can serve as a benchmark for different models or training strategies to evaluate their abilities to extract critical features from the disturbance of non-semantic representations. The original error-minimizing ULEs will lose efficiency under adversarial training, where the model can get over 83\% clean test accuracy. Meanwhile, even if adversarial training and strong data augmentation like RandAugment are applied together, the model trained on CIFAR-10-S cannot get over 50\% clean test accuracy.
    Gacs-Korner Common Information Variational Autoencoder. (arXiv:2205.12239v1 [cs.LG])
    We propose a notion of common information that allows one to quantify and separate the information that is shared between two random variables from the information that is unique to each. Our notion of common information is a variational relaxation of the G\'acs-K\"orner common information, which we recover as a special case, but is more amenable to optimization and can be approximated empirically using samples from the underlying distribution. We then provide a method to partition and quantify the common and unique information using a simple modification of a traditional variational auto-encoder. Empirically, we demonstrate that our formulation allows us to learn semantically meaningful common and unique factors of variation even on high-dimensional data such as images and videos. Moreover, on datasets where ground-truth latent factors are known, we show that we can accurately quantify the common information between the random variables. Additionally, we show that the auto-encoder that we learn recovers semantically meaningful disentangled factors of variation, even though we do not explicitly optimize for it.
    Federated singular value decomposition for high dimensional data. (arXiv:2205.12109v1 [cs.LG])
    Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical data, the high dimensionality poses a challenge in the context of federated genome-wide association studies (GWAS). In this article, we present a federated singular value decomposition (SVD) algorithm, suitable for the privacy-related and computational requirements of GWAS. Notably, the algorithm has a transmission cost independent of the number of samples and is only weakly dependent on the number of features, because the singular vectors associated with the samples are never exchanged and the vectors associated with the features only for a fixed number of iterations. Although motivated by GWAS, the algorithm is generically applicable for both horizontally and vertically partitioned data.
    Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization. (arXiv:2205.12191v1 [cs.CL])
    Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, we observe that these models exhibit poor out-of-distribution (OOD) generalization on the task of VQA. To better understand the underlying causes of poor generalization, we comprehensively investigate performance of two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also argue that in most cases generative models are less susceptible to shifts in data distribution, while frequently performing better on our tested benchmarks. Moreover, we find that multimodal pretraining improves OOD performance in most settings. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.
    Individual Topology Structure of Eye Movement Trajectories. (arXiv:2205.10667v2 [cs.CV] UPDATED)
    Traditionally, extracting patterns from eye movement data relies on statistics of different macro-events such as fixations and saccades. This requires an additional preprocessing step to separate the eye movement subtypes, often with a number of parameters on which the classification results depend. Besides that, definitions of such macro events are formulated in different ways by different researchers. We propose an application of a new class of features to the quantitative analysis of personal eye movement trajectories structure. This new class of features based on algebraic topology allows extracting patterns from different modalities of gaze such as time series of coordinates and amplitudes, heatmaps, and point clouds in a unified way at all scales from micro to macro. We experimentally demonstrate the competitiveness of the new class of features with the traditional ones and their significant synergy while being used together for the person authentication task on the recently published eye movement trajectories dataset.
    D$^\text{2}$UF: Deep Coded Aperture Design and Unrolling Algorithm for Compressive Spectral Image Fusion. (arXiv:2205.12158v1 [eess.IV])
    Compressive spectral imaging (CSI) has attracted significant attention since it employs synthetic apertures to codify spatial and spectral information, sensing only 2D projections of the 3D spectral image. However, these optical architectures suffer from a trade-off between the spatial and spectral resolution of the reconstructed image due to technology limitations. To overcome this issue, compressive spectral image fusion (CSIF) employs the projected measurements of two CSI architectures with different resolutions to estimate a high-spatial high-spectral resolution. This work presents the fusion of the compressive measurements of a low-spatial high-spectral resolution coded aperture snapshot spectral imager (CASSI) architecture and a high-spatial low-spectral resolution multispectral color filter array (MCFA) system. Unlike previous CSIF works, this paper proposes joint optimization of the sensing architectures and a reconstruction network in an end-to-end (E2E) manner. The trainable optical parameters are the coded aperture (CA) in the CASSI and the colored coded aperture in the MCFA system, employing a sigmoid activation function and regularization function to encourage binary values on the trainable variables for an implementation purpose. Additionally, an unrolling-based network inspired by the alternating direction method of multipliers (ADMM) optimization is formulated to address the reconstruction step and the acquisition systems design jointly. Finally, a spatial-spectral inspired loss function is employed at the end of each unrolling layer to increase the convergence of the unrolling network. The proposed method outperforms previous CSIF methods, and experimental results validate the method with real measurements.
    Deeper vs Wider: A Revisit of Transformer Configuration. (arXiv:2205.10505v2 [cs.LG] UPDATED)
    Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.
    Bias Discovery in Machine Learning Models for Mental Health. (arXiv:2205.12093v1 [cs.LG])
    Fairness and bias are crucial concepts in artificial intelligence, yet they are relatively ignored in machine learning applications in clinical psychiatry. We computed fairness metrics and present bias mitigation strategies using a model trained on clinical mental health data. We collected structured data related to the admission, diagnosis, and treatment of patients in the psychiatry department of the University Medical Center Utrecht. We trained a machine learning model to predict future administrations of benzodiazepines on the basis of past data. We found that gender plays an unexpected role in the predictions-this constitutes bias. Using the AI Fairness 360 package, we implemented reweighing and discrimination-aware regularization as bias mitigation strategies, and we explored their implications for model performance. This is the first application of bias exploration and mitigation in a machine learning model trained on real clinical psychiatry data.
    Inference of a Rumor's Source in the Independent Cascade Model. (arXiv:2205.12125v1 [cs.SI])
    We consider the so-called Independent Cascade Model for rumor spreading or epidemic processes popularized by Kempe et al.\ [2003]. In this model, a small subset of nodes from a network are the source of a rumor. In discrete time steps, each informed node "infects" each of its uninformed neighbors with probability $p$. While many facets of this process are studied in the literature, less is known about the inference problem: given a number of infected nodes in a network, can we learn the source of the rumor? In the context of epidemiology this problem is often referred to as patient zero problem. It belongs to a broader class of problems where the goal is to infer parameters of the underlying spreading model, see, e.g., Lokhov [NeurIPS'16] or Mastakouri et al. [NeurIPS'20]. In this work we present a maximum likelihood estimator for the rumor's source, given a snapshot of the process in terms of a set of active nodes $X$ after $t$ steps. Our results show that, for cycle-free graphs, the likelihood estimator undergoes a non-trivial phase transition as a function $t$. We provide a rigorous analysis for two prominent classes of acyclic network, namely $d$-regular trees and Galton-Watson trees, and verify empirically that our heuristics work well in various general networks.
    Phased Progressive Learning with Coupling-Regulation-Imbalance Loss for Imbalanced Classification. (arXiv:2205.12117v1 [cs.LG])
    Deep neural networks generally perform poorly with datasets that suffer from quantity imbalance and classification difficulty imbalance between different classes. In order to alleviate the problem of dataset bias or domain shift in the existing two-stage approaches, a phased progressive learning schedule was proposed for smoothly transferring the training emphasis from representation learning to upper classifier training. This has greater effectivity on datasets that have more severe imbalances or smaller scales. A coupling-regulation-imbalance loss function was designed, coupling a correction term, Focal loss and LDAM loss. Coupling-regulation-imbalance loss can better deal with quantity imbalance and outliers, while regulating focus-of-attention of samples with a variety of classification difficulties. Excellent results were achieved on multiple benchmark datasets using these approaches and they can be easily generalized for other imbalanced classification models. Our code will be open source soon.
    Not too little, not too much: a theoretical analysis of graph (over)smoothing. (arXiv:2205.12156v1 [stat.ML])
    We analyze graph smoothing with \emph{mean aggregation}, where each node successively receives the average of the features of its neighbors. Indeed, it has quickly been observed that Graph Neural Networks (GNNs), which generally follow some variant of Message-Passing (MP) with repeated aggregation, may be subject to the \emph{oversmoothing} phenomenon: by performing too many rounds of MP, the node features tend to converge to a non-informative limit. In the case of mean aggregation, for connected graphs, the node features become constant across the whole graph. At the other end of the spectrum, it is intuitively obvious that \emph{some} MP rounds are necessary, but existing analyses do not exhibit both phenomena at once: beneficial ``finite'' smoothing and oversmoothing in the limit. In this paper, we consider simplified linear GNNs, and rigorously analyze two examples for which a finite number of mean aggregation steps provably improves the learning performance, before oversmoothing kicks in. We consider a latent space random graph model, where node features are partial observations of the latent variables and the graph contains pairwise relationships between them. We show that graph smoothing restores some of the lost information, up to a certain point, by two phenomenon: graph smoothing shrinks non-principal directions in the data faster than principal ones, which is useful for regression, and shrinks nodes within communities faster than they collapse together, which improves classification.
    EXPANSE: A Deep Continual / Progressive Learning System for Deep Transfer Learning. (arXiv:2205.10356v2 [cs.LG] UPDATED)
    Deep transfer learning techniques try to tackle the limitations of deep learning, the dependency on extensive training data and the training costs, by reusing obtained knowledge. However, the current DTL techniques suffer from either catastrophic forgetting dilemma (losing the previously obtained knowledge) or overly biased pre-trained models (harder to adapt to target data) in finetuning pre-trained models or freezing a part of the pre-trained model, respectively. Progressive learning, a sub-category of DTL, reduces the effect of the overly biased model in the case of freezing earlier layers by adding a new layer to the end of a frozen pre-trained model. Even though it has been successful in many cases, it cannot yet handle distant source and target data. We propose a new continual/progressive learning approach for deep transfer learning to tackle these limitations. To avoid both catastrophic forgetting and overly biased-model problems, we expand the pre-trained model by expanding pre-trained layers (adding new nodes to each layer) in the model instead of only adding new layers. Hence the method is named EXPANSE. Our experimental results confirm that we can tackle distant source and target data using this technique. At the same time, the final model is still valid on the source data, achieving a promising deep continual learning approach. Moreover, we offer a new way of training deep learning models inspired by the human education system. We termed this two-step training: learning basics first, then adding complexities and uncertainties. The evaluation implies that the two-step training extracts more meaningful features and a finer basin on the error surface since it can achieve better accuracy in comparison to regular training. EXPANSE (model expansion and two-step training) is a systematic continual learning approach applicable to different problems and DL models.
    SepIt Approaching a Single Channel Speech Separation Bound. (arXiv:2205.11801v1 [eess.AS])
    We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
    Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. (arXiv:2203.11085v2 [cs.HC] UPDATED)
    Creating presentation slides is a critical but time-consuming task for data scientists. While researchers have proposed many AI techniques to lift data scientists' burden on data preparation and model selection, few have targeted the presentation creation task. Based on the needs identified from a formative study, this paper presents NB2Slides, an AI system that facilitates users to compose presentations of their data science work. NB2Slides uses deep learning methods as well as example-based prompts to generate slides from computational notebooks, and take users' input (e.g., audience background) to structure the slides. NB2Slides also provides an interactive visualization that links the slides with the notebook to help users further edit the slides. A follow-up user evaluation with 12 data scientists shows that participants believed NB2Slides can improve efficiency and reduces the complexity of creating slides. Yet, participants questioned the future of full automation and suggested a human-AI collaboration paradigm.
    Deep Reinforcement Learning for Multi-class Imbalanced Training. (arXiv:2205.12070v1 [cs.LG])
    With the rapid growth of memory and computing power, datasets are becoming increasingly complex and imbalanced. This is especially severe in the context of clinical data, where there may be one rare event for many cases in the majority class. We introduce an imbalanced classification framework, based on reinforcement learning, for training extremely imbalanced data sets, and extend it for use in multi-class settings. We combine dueling and double deep Q-learning architectures, and formulate a custom reward function and episode-training procedure, specifically with the added capability of handling multi-class imbalanced training. Using real-world clinical case studies, we demonstrate that our proposed framework outperforms current state-of-the-art imbalanced learning methods, achieving more fair and balanced classification, while also significantly improving the prediction of minority classes.
    RuMedBench: A Russian Medical Language Understanding Benchmark. (arXiv:2201.06499v2 [cs.CL] UPDATED)
    The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the sensitive nature of the data in healthcare, such a benchmark partially closes the problem of Russian medical dataset absence. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. The remaining tasks are from existing datasets with a few modifications. A single-number metric expresses a model's ability to cope with the benchmark. Moreover, we implement several baseline models, from simple ones to neural networks with transformer architecture, and release the code. Expectedly, the more advanced models yield better performance, but even a simple model is enough for a decent result in some tasks. Furthermore, for all tasks, we provide a human evaluation. Interestingly the models outperform humans in the large-scale classification tasks. However, the advantage of natural intelligence remains in the tasks requiring more knowledge and reasoning.
    KQGC: Knowledge Graph Embedding with Smoothing Effects of Graph Convolutions for Recommendation. (arXiv:2205.12102v1 [cs.IR])
    Leveraging graphs on recommender systems has gained popularity with the development of graph representation learning (GRL). In particular, knowledge graph embedding (KGE) and graph neural networks (GNNs) are representative GRL approaches, which have achieved the state-of-the-art performance on several recommendation tasks. Furthermore, combination of KGE and GNNs (KG-GNNs) has been explored and found effective in many academic literatures. One of the main characteristics of GNNs is their ability to retain structural properties among neighbors in the resulting dense representation, which is usually coined as smoothing. The smoothing is specially desired in the presence of homophilic graphs, such as the ones we find on recommender systems. In this paper, we propose a new model for recommender systems named Knowledge Query-based Graph Convolution (KQGC). In contrast to exisiting KG-GNNs, KQGC focuses on the smoothing, and leverages a simple linear graph convolution for smoothing KGE. A pre-trained KGE is fed into KQGC, and it is smoothed by aggregating neighbor knowledge queries, which allow entity-embeddings to be aligned on appropriate vector points for smoothing KGE effectively. We apply the proposed KQGC to a recommendation task that aims prospective users for specific products. Extensive experiments on a real E-commerce dataset demonstrate the effectiveness of KQGC.
    DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning. (arXiv:2204.08499v2 [cs.LG] UPDATED)
    Coreset selection, which aims to select a subset of the most informative training samples, is a long-standing learning problem that can benefit many downstream tasks such as data-efficient learning, continual learning, neural architecture search, active learning, etc. However, many existing coreset selection methods are not designed for deep learning, which may have high complexity and poor generalization performance. In addition, the recently proposed methods are evaluated on models, datasets, and settings of different complexities. To advance the research of coreset selection in deep learning, we contribute a comprehensive code library, namely DeepCore, and provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets. Extensive experiments on CIFAR10 and ImageNet datasets verify that, although various methods have advantages in certain experiment settings, random selection is still a strong baseline.
    Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study). (arXiv:2205.00664v2 [cs.LG] UPDATED)
    Test Input Prioritizers (TIP) for Deep Neural Networks (DNN) are an important technique to handle the typically very large test datasets efficiently, saving computation and labeling costs. This is particularly true for large-scale, deployed systems, where inputs observed in production are recorded to serve as potential test or training data for the next versions of the system. Feng et. al. propose DeepGini, a very fast and simple TIP, and show that it outperforms more elaborate techniques such as neuron- and surprise coverage. In a large-scale study (4 case studies, 8 test datasets, 32'200 trained models) we verify their findings. However, we also find that other comparable or even simpler baselines from the field of uncertainty quantification, such as the predicted softmax likelihood or the entropy of the predicted softmax likelihoods perform equally well as DeepGini.
    Selecting Continuous Life-Like Cellular Automata for Halting Unpredictability: Evolving for Abiogenesis. (arXiv:2204.07541v2 [cs.NE] UPDATED)
    Substantial efforts have been applied to engineer CA with desired emergent properties, such as supporting gliders. Recent work in continuous CA has generated a wide variety of compelling bioreminiscent patterns, and the expansion of CA research into continuously-valued domains, multiple channels, and higher dimensions complicates their study. In this work we devise a strategy for evolving CA and CA patterns in two steps, based on the simple idea that CA are likely to be complex and computationally capable if they support patterns that grow indefinitely as well as patterns that vanish completely, and are difficult to predict the difference in advance. The second part of our strategy evolves patterns by selecting for mobility and conservation of mean cell value. We validate our pattern evolution method by re-discovering gliders in 17 of 17 Lenia CA, and also report 4 new evolved CA and 1 randomly evolved CA that support novel evolved glider patterns. The CA reported here share neighborhood kernels with previously described Lenia CA, but exhibit a wider range of typical dynamics than their Lenia counterparts. Code for evolving continuous CA is made available under an MIT License (https://github.com/rivesunder/yuca).
    Learning Stabilizing Policies in Stochastic Control Systems. (arXiv:2205.11991v1 [cs.LG])
    In this work, we address the problem of learning provably stable neural network policies for stochastic control systems. While recent work has demonstrated the feasibility of certifying given policies using martingale theory, the problem of how to learn such policies is little explored. Here, we study the effectiveness of jointly learning a policy together with a martingale certificate that proves its stability using a single learning algorithm. We observe that the joint optimization problem becomes easily stuck in local minima when starting from a randomly initialized policy. Our results suggest that some form of pre-training of the policy is required for the joint optimization to repair and verify the policy successfully.
    Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations. (arXiv:2204.02683v2 [cs.LG] UPDATED)
    Contrastive learning is a highly effective method for learning representations from unlabeled data. Recent works show that contrastive representations can transfer across domains, leading to simple state-of-the-art algorithms for unsupervised domain adaptation. In particular, a linear classifier trained to separate the representations on the source domain can also predict classes on the target domain accurately, even though the representations of the two domains are far from each other. We refer to this phenomenon as linear transferability. This paper analyzes when and why contrastive representations exhibit linear transferability in a general unsupervised domain adaptation setting. We prove that linear transferability can occur when data from the same class in different domains (e.g., photo dogs and cartoon dogs) are more related with each other than data from different classes in different domains (e.g., photo dogs and cartoon cats) are. Our analyses are in a realistic regime where the source and target domains can have unbounded density ratios and be weakly related, and they have distant representations across domains.
    Highly Accurate FMRI ADHD Classification using time distributed multi modal 3D CNNs. (arXiv:2205.11993v1 [cs.LG])
    This work proposes an algorithm for fMRI data analysis for the classification of ADHD disorders. There have been several breakthroughs in the analysis of fMRI via 3D convolutional neural networks (CNNs). With these new techniques it is possible to preserve the 3D spatial data of fMRI data. Additionally there have been recent advances in the use of 3D generative adversarial neural networks (GANs) for the generation of normal MRI data. This work utilizes multi modal 3D CNNs with data augmentation from 3D GAN for ADHD prediction from fMRI. By leveraging a 3D-GAN it would be possible to use deepfake data to enhance the accuracy of 3D CNN classification of brain disorders. A comparison will be made between a time distributed single modal 3D CNN model for classification and the modified multi modal model with MRI data as well.
    PatchNR: Learning from Small Data by Patch Normalizing Flow Regularization. (arXiv:2205.12021v1 [cs.LG])
    Learning neural networks using only a small amount of data is an important ongoing research topic with tremendous potential for applications. In this paper, we introduce a regularizer for the variational modeling of inverse problems in imaging based on normalizing flows. Our regularizer, called patchNR, involves a normalizing flow learned on patches of very few images. The subsequent reconstruction method is completely unsupervised and the same regularizer can be used for different forward operators acting on the same class of images. By investigating the distribution of patches versus those of the whole image class, we prove that our variational model is indeed a MAP approach. Our model can be generalized to conditional patchNRs, if additional supervised information is available. Numerical examples for low-dose CT, limited-angle CT and superresolution of material images demonstrate that our method provides high quality results among unsupervised methods, but requires only few data.
    Improving Human Image Synthesis with Residual Fast Fourier Transformation and Wasserstein Distance. (arXiv:2205.12022v1 [cs.CV])
    With the rapid development of the Metaverse, virtual humans have emerged, and human image synthesis and editing techniques, such as pose transfer, have recently become popular. Most of the existing techniques rely on GANs, which can generate good human images even with large variants and occlusions. But from our best knowledge, the existing state-of-the-art method still has the following problems: the first is that the rendering effect of the synthetic image is not realistic, such as poor rendering of some regions. And the second is that the training of GAN is unstable and slow to converge, such as model collapse. Based on the above two problems, we propose several methods to solve them. To improve the rendering effect, we use the Residual Fast Fourier Transform Block to replace the traditional Residual Block. Then, spectral normalization and Wasserstein distance are used to improve the speed and stability of GAN training. Experiments demonstrate that the methods we offer are effective at solving the problems listed above, and we get state-of-the-art scores in LPIPS and PSNR.
    FedEntropy: Efficient Device Grouping for Federated Learning Using Maximum Entropy Judgment. (arXiv:2205.12038v1 [cs.LG])
    Along with the popularity of Artificial Intelligence (AI) and Internet-of-Things (IoT), Federated Learning (FL) has attracted steadily increasing attentions as a promising distributed machine learning paradigm, which enables the training of a central model on for numerous decentralized devices without exposing their privacy. However, due to the biased data distributions on involved devices, FL inherently suffers from low classification accuracy in non-IID scenarios. Although various device grouping method have been proposed to address this problem, most of them neglect both i) distinct data distribution characteristics of heterogeneous devices, and ii) contributions and hazards of local models, which are extremely important in determining the quality of global model aggregation. In this paper, we present an effective FL method named FedEntropy with a novel dynamic device grouping scheme, which makes full use of the above two factors based on our proposed maximum entropy judgement heuristic.Unlike existing FL methods that directly aggregate local models returned from all the selected devices, in one FL round FedEntropy firstly makes a judgement based on the pre-collected soft labels of selected devices and then only aggregates the local models that can maximize the overall entropy of these soft labels. Without collecting local models that are harmful for aggregation, FedEntropy can effectively improve global model accuracy while reducing the overall communication overhead. Comprehensive experimental results on well-known benchmarks show that, FedEntropy not only outperforms state-of-the-art FL methods in terms of model accuracy and communication overhead, but also can be integrated into them to enhance their classification performance.
    Naive Few-Shot Learning: Sequence Consistency Evaluation. (arXiv:2205.12013v1 [cs.AI])
    Cognitive psychologists often use the term $\textit{fluid intelligence}$ to describe the ability of humans to solve novel tasks without any prior training. In contrast to humans, deep neural networks can perform cognitive tasks only after extensive (pre-)training with a large number of relevant examples. Motivated by fluid intelligence research in the cognitive sciences, we built a benchmark task which we call sequence consistency evaluation (SCE) that can be used to address this gap. Solving the SCE task requires the ability to extract simple rules from sequences, a basic computation that is required for solving various intelligence tests in humans. We tested $\textit{untrained}$ (naive) deep learning models in the SCE task. Specifically, we compared Relation Networks (RN) and Contrastive Predictive Coding (CPC), two models that can extract simple rules from sequences, and found that the latter, which imposes a structure on the predictable rule does better. We further found that simple networks fare better in this task than complex ones. Finally, we show that this approach can be used for security camera anomaly detection without any prior training.
    Theoretical Analysis of Primal-Dual Algorithm for Non-Convex Stochastic Decentralized Optimization. (arXiv:2205.11979v1 [math.OC])
    In recent years, decentralized learning has emerged as a powerful tool not only for large-scale machine learning, but also for preserving privacy. One of the key challenges in decentralized learning is that the data distribution held by each node is statistically heterogeneous. To address this challenge, the primal-dual algorithm called the Edge-Consensus Learning (ECL) was proposed and was experimentally shown to be robust to the heterogeneity of data distributions. However, the convergence rate of the ECL is provided only when the objective function is convex, and has not been shown in a standard machine learning setting where the objective function is non-convex. Furthermore, the intuitive reason why the ECL is robust to the heterogeneity of data distributions has not been investigated. In this work, we first investigate the relationship between the ECL and Gossip algorithm and show that the update formulas of the ECL can be regarded as correcting the local stochastic gradient in the Gossip algorithm. Then, we propose the Generalized ECL (G-ECL), which contains the ECL as a special case, and provide the convergence rates of the G-ECL in both (strongly) convex and non-convex settings, which do not depend on the heterogeneity of data distributions. Through synthetic experiments, we demonstrate that the numerical results of both the G-ECL and ECL coincide with the convergence rate of the G-ECL.
    Can Adversarial Training Be Manipulated By Non-Robust Features?. (arXiv:2201.13329v2 [cs.LG] UPDATED)
    Adversarial training, originally designed to resist test-time adversarial examples, has shown to be promising in mitigating training-time availability attacks. This defense ability, however, is challenged in this paper. We identify a novel threat model named stability attacks, which aims to hinder robust availability by slightly manipulating the training data. Under this threat, we show that adversarial training using a conventional defense budget $\epsilon$ provably fails to provide test robustness in a simple statistical setting, where the non-robust features of the training data can be reinforced by $\epsilon$-bounded perturbation. Further, we analyze the necessity of enlarging the defense budget to counter stability attacks. Finally, comprehensive experiments demonstrate that stability attacks are harmful on benchmark datasets, and thus the adaptive defense is necessary to maintain robustness.
    3D helical CT reconstruction with memory efficient invertible Learned Primal-Dual method. (arXiv:2205.11952v1 [eess.IV])
    Helical acquisition geometry is the most common geometry used in computed tomography (CT) scanners for medical imaging. We adapt the invertible Learned Primal-Dual (iLPD) deep neural network architecture so that it can be applied to helical 3D CT reconstruction. We achieve this by splitting the geometry and the data in parts that fit the memory and by splitting images into corresponding sub-volumes. The architecture can be applied to images different in size along the rotation axis. We perform the experiments on tomographic data simulated from realistic helical geometries.
    Realization Theory Of Recurrent Neural ODEs Using Polynomial System Embeddings. (arXiv:2205.11989v1 [math.OC])
    In this paper we show that neural ODE analogs of recurrent (ODE-RNN) and Long Short-Term Memory (ODE-LSTM) networks can be algorithmically embeddeded into the class of polynomial systems. This embedding preserves input-output behavior and can suitably be extended to other neural DE architectures. We then use realization theory of polynomial systems to provide necessary conditions for an input-output map to be realizable by an ODE-LSTM and sufficient conditions for minimality of such systems. These results represent the first steps towards realization theory of recurrent neural ODE architectures, which is is expected be useful for model reduction and learning algorithm analysis of recurrent neural ODEs.
    How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory. (arXiv:2205.11930v1 [cs.CL])
    Human ratings are treated as the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and then rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. In this work, we analyze this standard protocol through the lens of utility theory in economics. We first identify the implicit assumptions it makes about annotators and find that these assumptions are often violated in practice, in which case annotator ratings become an unfaithful reflection of their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new evaluation protocol called $\textit{system-level probabilistic assessment}$ (SPA). In our experiments, we find that according to SPA, annotators prefer larger GPT-3 variants to smaller ones -- as expected -- with all comparisons being statistically significant. In contrast, the standard protocol only yields significant results half the time.
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v1 [stat.ML])
    Most machine learning methods depend on the tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation, which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection method based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.
    Assessing the Quality of Computational Notebooks for a Frictionless Transition from Exploration to Production. (arXiv:2205.11941v1 [cs.SE])
    The massive trend of integrating data-driven AI capabilities into traditional software systems is rising new intriguing challenges. One of such challenges is achieving a smooth transition from the explorative phase of Machine Learning projects - in which data scientists build prototypical models in the lab - to their production phase - in which software engineers translate prototypes into production-ready AI components. To narrow down the gap between these two phases, tools and practices adopted by data scientists might be improved by incorporating consolidated software engineering solutions. In particular, computational notebooks have a prominent role in determining the quality of data science prototypes. In my research project, I address this challenge by studying the best practices for collaboration with computational notebooks and proposing proof-of-concept tools to foster guidelines compliance.
    Estimation of Convex Polytopes for Automatic Discovery of Charge State Transitions in Quantum Dot Arrays. (arXiv:2108.09133v2 [cs.LG] UPDATED)
    In spin based quantum dot arrays, material or fabrication imprecisions affect the behaviour of the device, which must be taken into account when controlling it. This requires measuring the shape of specific convex polytopes. In this work, we present an algorithm that automatically discovers count, shape and size of the facets of a convex polytope from measurements. Results on simulated devices as well as a real 2x2 spin qubit array show that we can reliably find the facets of the convex polytopes, including small facets with sizes on the order of the measurement precision.
    A Data-Centric Optimization Framework for Machine Learning. (arXiv:2110.10802v2 [cs.LG] UPDATED)
    Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization. The pipeline begins with standard networks in PyTorch or ONNX and transforms computation through progressive lowering. We define four levels of general-purpose transformations, from local intra-operator optimizations to global data movement reduction. These operate on a data-centric graph intermediate representation that expresses computation and data movement at all levels of abstraction, including expanding basic operators such as convolutions to their underlying computations. Central to the design is the interactive and introspectable nature of the pipeline. Every part is extensible through a Python API, and can be tuned interactively using a GUI. We demonstrate competitive performance or speedups on ten different networks, with interactive optimizations discovering new opportunities in EfficientNet.
    MS-nowcasting: Operational Precipitation Nowcasting with Convolutional LSTMs at Microsoft Weather. (arXiv:2111.09954v2 [cs.LG] UPDATED)
    We present the encoder-forecaster convolutional long short-term memory (LSTM) deep-learning model that powers Microsoft Weather's operational precipitation nowcasting product. This model takes as input a sequence of weather radar mosaics and deterministically predicts future radar reflectivity at lead times up to 6 hours. By stacking a large input receptive field along the feature dimension and conditioning the model's forecaster with predictions from the physics-based High Resolution Rapid Refresh (HRRR) model, we are able to outperform optical flow and HRRR baselines by 20-25% on multiple metrics averaged over all lead times.
    Predicting Physics in Mesh-reduced Space with Temporal Attention. (arXiv:2201.09113v3 [cs.LG] UPDATED)
    Graph-based next-step prediction models have recently been very successful in modeling complex high-dimensional physical systems on irregular meshes. However, due to their short temporal attention span, these models suffer from error accumulation and drift. In this paper, we propose a new method that captures long-term dependencies through a transformer-style temporal attention model. We introduce an encoder-decoder structure to summarize features and create a compact mesh representation of the system state, to allow the temporal model to operate on a low-dimensional mesh representations in a memory efficient manner. Our method outperforms a competitive GNN baseline on several complex fluid dynamics prediction tasks, from sonic shocks to vascular flow. We demonstrate stable rollouts without the need for training noise and show perfectly phase-stable predictions even for very long sequences. More broadly, we believe our approach paves the way to bringing the benefits of attention-based sequence models to solving high-dimensional complex physics tasks.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v3 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs. (arXiv:2205.11894v1 [cs.LG])
    We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. We introduce a new model that decomposes independent dynamics of single objects accurately from their interactions. By employing latent Gaussian process ordinary differential equations, our model infers both independent dynamics and their interactions with reliable uncertainty estimates. In our formulation, each object is represented as a graph node and interactions are modeled by accumulating the messages coming from neighboring objects. We show that efficient inference of such a complex network of variables is possible with modern variational sparse Gaussian process inference techniques. We empirically demonstrate that our model improves the reliability of long-term predictions over neural network based alternatives and it successfully handles missing dynamic or static information. Furthermore, we observe that only our model can successfully encapsulate independent dynamics and interaction information in distinct functions and show the benefit from this disentanglement in extrapolation scenarios.
    Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision. (arXiv:2205.11913v1 [cs.DC])
    Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-SystemGroup/Awesome-DL-Scheduling-Papers
    Large Language Models are Zero-Shot Reasoners. (arXiv:2205.11916v1 [cs.CL])
    Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding ``Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with an off-the-shelf 175B parameter model. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted through simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.
    Physics-Embedded Neural Networks: $\boldsymbol{\mathrm{E}(n)}$-Equivariant Graph Neural PDE Solvers. (arXiv:2205.11912v1 [cs.LG])
    Graph neural network (GNN) is a promising approach to learning and predicting physical phenomena described in boundary value problems, such as partial differential equations (PDEs) with boundary conditions. However, existing models inadequately treat boundary conditions essential for the reliable prediction of such problems. In addition, because of the locally connected nature of GNNs, it is difficult to accurately predict the state after a long time, where interaction between vertices tends to be global. We present our approach termed physics-embedded neural networks that considers boundary conditions and predicts the state after a long time using an implicit method. It is built based on an $\mathrm{E}(n)$-equivariant GNN, resulting in high generalization performance on various shapes. We demonstrate that our model learns flow phenomena in complex shapes and outperforms a well-optimized classical solver and a state-of-the-art machine learning model in speed-accuracy trade-off. Therefore, our model can be a useful standard for realizing reliable, fast, and accurate GNN-based PDE solvers.
    The Data-Production Dispositif. (arXiv:2205.11963v1 [cs.HC])
    Machine learning (ML) depends on data to train and verify models. Very often, organizations outsource processes related to data work (i.e., generating and annotating data and evaluating outputs) through business process outsourcing (BPO) companies and crowdsourcing platforms. This paper investigates outsourced ML data work in Latin America by studying three platforms in Venezuela and a BPO in Argentina. We lean on the Foucauldian notion of dispositif to define the data-production dispositif as an ensemble of discourses, actions, and objects strategically disposed to (re)produce power/knowledge relations in data and labor. Our dispositif analysis comprises the examination of 210 data work instruction documents, 55 interviews with data workers, managers, and requesters, and participant observation. Our findings show that discourses encoded in instructions reproduce and normalize the worldviews of requesters. Precarious working conditions and economic dependency alienate workers, making them obedient to instructions. Furthermore, discourses and social contexts materialize in artifacts, such as interfaces and performance metrics, limiting workers' agency and normalizing specific ways of interpreting data. We conclude by stressing the importance of counteracting the data-production dispositif by fighting alienation and precarization, and empowering data workers to become assets in the quest for high-quality data.
    Multi-Agent Collaborative Inference via DNN Decoupling: Intermediate Feature Compression and Edge Learning. (arXiv:2205.11854v1 [cs.LG])
    Recently, deploying deep neural network (DNN) models via collaborative inference, which splits a pre-trained model into two parts and executes them on user equipment (UE) and edge server respectively, becomes attractive. However, the large intermediate feature of DNN impedes flexible decoupling, and existing approaches either focus on the single UE scenario or simply define tasks considering the required CPU cycles, but ignore the indivisibility of a single DNN layer. In this paper, we study the multi-agent collaborative inference scenario, where a single edge server coordinates the inference of multiple UEs. Our goal is to achieve fast and energy-efficient inference for all UEs. To achieve this goal, we first design a lightweight autoencoder-based method to compress the large intermediate feature. Then we define tasks according to the inference overhead of DNNs and formulate the problem as a Markov decision process (MDP). Finally, we propose a multi-agent hybrid proximal policy optimization (MAHPPO) algorithm to solve the optimization problem with a hybrid action space. We conduct extensive experiments with different types of networks, and the results show that our method can reduce up to 56\% of inference latency and save up to 72\% of energy consumption.
    Causal Influences Decouple From Their Underlying Network Structure In Echo State Networks. (arXiv:2205.11947v1 [cs.LG])
    Echo State Networks (ESN) are versatile recurrent neural network models in which the hidden layer remains unaltered during training. Interactions among nodes of this static backbone produce diverse representations of the given stimuli that are harnessed by a read-out mechanism to perform computations needed for solving a given task. ESNs are accessible models of neuronal circuits, since they are relatively inexpensive to train. Therefore, ESNs have become attractive for neuroscientists studying the relationship between neural structure, function, and behavior. For instance, it is not yet clear how distinctive connectivity patterns of brain networks support effective interactions among their nodes and how these patterns of interactions give rise to computation. To address this question, we employed an ESN with a biologically inspired structure and used a systematic multi-site lesioning framework to quantify the causal contribution of each node to the network's output, thus providing a causal link between network structure and behavior. We then focused on the structure-function relationship and decomposed the causal influence of each node on all other nodes, using the same lesioning framework. We found that nodes in a properly engineered ESN interact largely irrespective of the network's underlying structure. However, in a network with the same topology and a non-optimal parameter set, the underlying connectivity patterns determine the node interactions. Our results suggest that causal structure-function relations in ESNs can be decomposed into two components, direct and indirect interactions. The former are based on influences relying on structural connections. The latter describe the effective communication between any two nodes through other intermediate nodes. These widely distributed indirect interactions may crucially contribute to the efficient performance of ESNs.
    An Adaptive Contrastive Learning Model for Spike Sorting. (arXiv:2205.11914v1 [cs.LG])
    Brain-computer interfaces (BCIs), is ways for electronic devices to communicate directly with the brain. For most medical-type brain-computer interface tasks, the activity of multiple units of neurons or local field potentials is sufficient for decoding. But for BCIs used in neuroscience research, it is important to separate out the activity of individual neurons. With the development of large-scale silicon technology and the increasing number of probe channels, artificially interpreting and labeling spikes is becoming increasingly impractical. In this paper, we propose a novel modeling framework: Adaptive Contrastive Learning Model that learns representations from spikes through contrastive learning based on the maximizing mutual information loss function as a theoretical basis. Based on the fact that data with similar features share the same labels whether they are multi-classified or binary-classified. With this theoretical support, we simplify the multi-classification problem into multiple binary-classification, improving both the accuracy and the runtime efficiency. Moreover, we also introduce a series of enhancements for the spikes, while solving the problem that the classification effect is affected because of the overlapping spikes.
    A Quadrature Rule combining Control Variates and Adaptive Importance Sampling. (arXiv:2205.11890v1 [stat.ML])
    Driven by several successful applications such as in stochastic gradient descent or in Bayesian computation, control variates have become a major tool for Monte Carlo integration. However, standard methods do not allow the distribution of the particles to evolve during the algorithm, as is the case in sequential simulation methods. Within the standard adaptive importance sampling framework, a simple weighted least squares approach is proposed to improve the procedure with control variates. The procedure takes the form of a quadrature rule with adapted quadrature weights to reflect the information brought in by the control variates. The quadrature points and weights do not depend on the integrand, a computational advantage in case of multiple integrands. Moreover, the target density needs to be known only up to a multiplicative constant. Our main result is a non-asymptotic bound on the probabilistic error of the procedure. The bound proves that for improving the estimate's accuracy, the benefits from adaptive importance sampling and control variates can be combined. The good behavior of the method is illustrated empirically on synthetic examples and real-world data for Bayesian linear regression.
    Advanced Manufacturing Configuration by Sample-efficient Batch Bayesian Optimization. (arXiv:2205.11827v1 [cs.LG])
    We propose a framework for the configuration and operation of expensive-to-evaluate advanced manufacturing methods, based on Bayesian optimization. The framework unifies a tailored acquisition function, a parallel acquisition procedure, and the integration of process information providing context to the optimization procedure. The novel acquisition function is demonstrated and analyzed on benchmark illustrative problems. We apply the optimization approach to atmospheric plasma spraying in simulation and experiments. Our results demonstrate that the proposed framework can efficiently find input parameters that produce the desired outcome and minimize the process cost.
    Accelerating Frank-Wolfe via Averaging Step Directions. (arXiv:2205.11794v1 [math.OC])
    The Frank-Wolfe method is a popular method in sparse constrained optimization, due to its fast per-iteration complexity. However, the tradeoff is that its worst case global convergence is comparatively slow, and importantly, is fundamentally slower than its flow rate--that is to say, the convergence rate is throttled by discretization error. In this work, we consider a modified Frank-Wolfe where the step direction is a simple weighted average of past oracle calls. This method requires very little memory and computational overhead, and provably decays this discretization error term. Numerically, we show that this method improves the convergence rate over several problems, especially after the sparse manifold has been detected. Theoretically, we show the method has an overall global convergence rate of $O(1/k^p)$, where $0< p < 1$; after manifold identification, this rate speeds to $O(1/k^{3p/2})$. We also observe that the method achieves this accelerated rate from a very early stage, suggesting a promising mode of acceleration for this family of methods.
    Why KDAC? A general activation function for knowledge discovery. (arXiv:2111.13858v4 [cs.LG] UPDATED)
    Deep learning oriented named entity recognition (DNER) has gradually become the paradigm of knowledge discovery, which greatly promotes domain intelligence. However, the current activation function of DNER fails to treat gradient vanishing, no negative output or non-differentiable existence, which may impede knowledge exploration caused by the omission and incomplete representation of latent semantics. To break through the dilemma, we present a novel activation function termed KDAC. Detailly, KDAC is an aggregation function with multiple conversion modes. The backbone of the activation region is the interaction between exponent and linearity, and the both ends extend through adaptive linear divergence, which surmounts the obstacle of gradient vanishing and no negative output. Crucially, the non-differentiable points are alerted and eliminated by an approximate smoothing algorithm. KDAC has a series of brilliant properties, including nonlinear, stable near-linear transformation and derivative, as well as dynamic style, etc. We perform experiments based on BERT-BiLSTM-CNN-CRF model on six benchmark datasets containing different domain knowledge, such as Weibo, Clinical, E-commerce, Resume, HAZOP and People's daily. The evaluation results show that KDAC is advanced and effective, and can provide more generalized activation to stimulate the performance of DNER. We hope that KDAC can be exploited as a promising activation function to devote itself to the construction of knowledge.
    Energy Forecasting in Smart Grid Systems: A Review of the State-of-the-art Techniques. (arXiv:2011.12598v3 [cs.LG] UPDATED)
    Energy forecasting has a vital role to play in smart grid (SG) systems involving various applications such as demand-side management, load shedding, and optimum dispatch. Managing efficient forecasting while ensuring the least possible prediction error is one of the main challenges posed in the grid today, considering the uncertainty and granularity in SG data. This paper presents a comprehensive and application-oriented review of state-of-the-art forecasting methods for SG systems along with recent developments in probabilistic deep learning (PDL) considering different models and architectures. Traditional point forecasting methods including statistical, machine learning (ML), and deep learning (DL) are extensively investigated in terms of their applicability to energy forecasting. In addition, the significance of hybrid and data pre-processing techniques to support forecasting performance is also studied. A comparative case study using the Victorian electricity consumption and American electric power (AEP) datasets is conducted to analyze the performance of point and probabilistic forecasting methods. The analysis demonstrates higher accuracy of the long-short term memory (LSTM) models with appropriate hyper-parameter tuning among point forecasting methods especially when sample sizes are larger and involve nonlinear patterns with long sequences. Furthermore, Bayesian bidirectional LSTM (BLSTM) as a probabilistic method exhibit the highest accuracy in terms of least pinball score and root mean square error (RMSE).
    Neural Distributed Source Coding. (arXiv:2106.02797v2 [cs.IT] UPDATED)
    Distributed source coding (DSC) is the task of encoding an input in the absence of correlated side information that is only available to the decoder. Remarkably, Slepian and Wolf showed in 1973 that an encoder without access to the side information can asymptotically achieve the same compression rate as when the side information is available to it. While there is vast prior work on this topic, practical DSC has been limited to synthetic datasets and specific correlation structures. Here we present a framework for lossy DSC that is agnostic to the correlation structure and can scale to high dimensions. Rather than relying on hand-crafted source-modeling, our method utilizes a conditional VQ-VAE to learn the distributed encoder and decoder. We evaluate our method on multiple datasets and show that our method can handle complex correlations -- significantly better than the current state-of-the-art method.
    Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits. (arXiv:2002.00287v3 [cs.LG] UPDATED)
    We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the $d$-dimensional contexts are generated i.i.d.~at random from a known distributions, we develop computationally efficient algorithms based on the classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches the best available bound for this problem. Our second algorithm, RobustLinExp3, is shown to be robust to misspecification, in that it achieves a regret bound of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true reward function is linear up to an additive nonlinear error uniformly bounded in absolute value by $\varepsilon$. To our knowledge, our performance guarantees constitute the very first results on this problem setting.
    Out-of-domain Detection for Natural Language Understanding in Dialog Systems. (arXiv:1909.03862v4 [cs.CL] UPDATED)
    Natural Language Understanding (NLU) is a vital component of dialogue systems, and its ability to detect Out-of-Domain (OOD) inputs is critical in practical applications, since the acceptance of the OOD input that is unsupported by the current system may lead to catastrophic failure. However, most existing OOD detection methods rely heavily on manually labeled OOD samples and cannot take full advantage of unlabeled data. This limits the feasibility of these models in practical applications. In this paper, we propose a novel model to generate high-quality pseudo OOD samples that are akin to IN-Domain (IND) input utterances, and thereby improves the performance of OOD detection. To this end, an autoencoder is trained to map an input utterance into a latent code. and the codes of IND and OOD samples are trained to be indistinguishable by utilizing a generative adversarial network. To provide more supervision signals, an auxiliary classifier is introduced to regularize the generated OOD samples to have indistinguishable intent labels. Experiments show that these pseudo OOD samples generated by our model can be used to effectively improve OOD detection in NLU. Besides, we also demonstrate that the effectiveness of these pseudo OOD data can be further improved by efficiently utilizing unlabeled data.
    CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing. (arXiv:2205.11845v1 [cs.CV])
    Recently, the compression and deployment of powerful deep neural networks (DNNs) on resource-limited edge devices to provide intelligent services have become attractive tasks. Although knowledge distillation (KD) is a feasible solution for compression, its requirement on the original dataset raises privacy concerns. In addition, it is common to integrate multiple pretrained models to achieve satisfactory performance. How to compress multiple models into a tiny model is challenging, especially when the original data are unavailable. To tackle this challenge, we propose a framework termed collaborative data-free knowledge distillation via multi-level feature sharing (CDFKD-MFS), which consists of a multi-header student module, an asymmetric adversarial data-free KD module, and an attention-based aggregation module. In this framework, the student model equipped with a multi-level feature-sharing structure learns from multiple teacher models and is trained together with a generator in an asymmetric adversarial manner. When some real samples are available, the attention module adaptively aggregates predictions of the student headers, which can further improve performance. We conduct extensive experiments on three popular computer visual datasets. In particular, compared with the most competitive alternative, the accuracy of the proposed framework is 1.18\% higher on the CIFAR-100 dataset, 1.67\% higher on the Caltech-101 dataset, and 2.99\% higher on the mini-ImageNet dataset.
    Learning to Assemble Geometric Shapes. (arXiv:2205.11809v1 [cs.CV])
    Assembling parts into an object is a combinatorial problem that arises in a variety of contexts in the real world and involves numerous applications in science and engineering. Previous related work tackles limited cases with identical unit parts or jigsaw-style parts of textured shapes, which greatly mitigate combinatorial challenges of the problem. In this work, we introduce the more challenging problem of shape assembly, which involves textureless fragments of arbitrary shapes with indistinctive junctions, and then propose a learning-based approach to solving it. We demonstrate the effectiveness on shape assembly tasks with various scenarios, including the ones with abnormal fragments (e.g., missing and distorted), the different number of fragments, and different rotation discretization.
    Attributing AUC-ROC to Analyze Binary Classifier Performance. (arXiv:2205.11781v1 [cs.LG])
    Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a popular evaluation metric for binary classifiers. In this paper, we discuss techniques to segment the AUC-ROC along human-interpretable dimensions. AUC-ROC is not an additive/linear function over the data samples, therefore such segmenting the overall AUC-ROC is different from tabulating the AUC-ROC of data segments. To segment the overall AUC-ROC, we must first solve an \emph{attribution} problem to identify credit for individual examples. We observe that AUC-ROC, though non-linear over examples, is linear over \emph{pairs} of examples. This observation leads to a simple, efficient attribution technique for examples (example attributions), and for pairs of examples (pair attributions). We automatically slice these attributions using decision trees by making the tree predict the attributions; we use the notion of honest estimates along with a t-test to mitigate false discovery. Our experiments with the method show that an inferior model can outperform a superior model (trained to optimize a different training objective) on the inferior model's own training objective, a manifestation of Goodhart's Law. In contrast, AUC attributions enable a reasonable comparison. Example attributions can be used to slice this comparison. Pair attributions are used to categorize pairs of items -- one positively labeled and one negatively -- that the model has trouble separating. These categories identify the decision boundary of the classifier and the headroom to improve AUC.
    Penalized Proximal Policy Optimization for Safe Reinforcement Learning. (arXiv:2205.11814v1 [cs.LG])
    Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
    NFL: Robust Learned Index via Distribution Transformation. (arXiv:2205.11807v1 [cs.DB])
    Recent works on learned index open a new direction for the indexing field. The key insight of the learned index is to approximate the mapping between keys and positions with piece-wise linear functions. Such methods require partitioning key space for a better approximation. Although lots of heuristics are proposed to improve the approximation quality, the bottleneck is that the segmentation overheads could hinder the overall performance. This paper tackles the approximation problem by applying a \textit{distribution transformation} to the keys before constructing the learned index. A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys. For effective distribution transformation, we propose a Numerical Normalizing Flow (Numerical NF). Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI). To validate the performance, comprehensive evaluations are conducted on both synthetic and real-world workloads, which shows that the proposed NFL produces the highest throughput and the lowest tail latency compared to the state-of-the-art learned indexes.
    Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model. (arXiv:2205.11833v1 [cs.LG])
    Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks, and their ensemble outperformed the standard ensemble on some tasks.
    Alleviating Robust Overfitting of Adversarial Training With Consistency Regularization. (arXiv:2205.11744v1 [cs.LG])
    Adversarial training (AT) has proven to be one of the most effective ways to defend Deep Neural Networks (DNNs) against adversarial attacks. However, the phenomenon of robust overfitting, i.e., the robustness will drop sharply at a certain stage, always exists during AT. It is of great importance to decrease this robust generalization gap in order to obtain a robust model. In this paper, we present an in-depth study towards the robust overfitting from a new angle. We observe that consistency regularization, a popular technique in semi-supervised learning, has a similar goal as AT and can be used to alleviate robust overfitting. We empirically validate this observation, and find a majority of prior solutions have implicit connections to consistency regularization. Motivated by this, we introduce a new AT solution, which integrates the consistency regularization and Mean Teacher (MT) strategy into AT. Specifically, we introduce a teacher model, coming from the average weights of the student models over the training steps. Then we design a consistency loss function to make the prediction distribution of the student models over adversarial examples consistent with that of the teacher model over clean samples. Experiments show that our proposed method can effectively alleviate robust overfitting and improve the robustness of DNN models against common adversarial attacks.
    Wireless Ad Hoc Federated Learning: A Fully Distributed Cooperative Machine Learning. (arXiv:2205.11779v1 [cs.LG])
    Federated learning has allowed training of a global model by aggregating local models trained on local nodes. However, it still takes client-server model, which can be further distributed, fully decentralized, or even partially connected, or totally opportunistic. In this paper, we propose a wireless ad hoc federated learning (WAFL) -- a fully distributed cooperative machine learning organized by the nodes physically nearby. Here, each node has a wireless interface and can communicate with each other when they are within the radio range. The nodes are expected to move with people, vehicles, or robots, producing opportunistic contacts with each other. In WAFL, each node trains a model individually with the local data it has. When a node encounter with others, they exchange their trained models, and generate new aggregated models, which are expected to be more general compared to the locally trained models on Non-IID data. For evaluation, we have prepared four static communication networks and two types of dynamic and opportunistic communication networks based on random waypoint mobility and community-structured environment, and then studied the training process of a fully connected neural network with 90% Non-IID MNIST dataset. The evaluation results indicate that WAFL allowed the convergence of model parameters among the nodes toward generalization, even with opportunistic node contact scenarios -- whereas in self-training (or lonely training) case, they have diverged. This WAFL's model generalization contributed to achieving higher accuracy 94.7-96.2% to the testing IID dataset compared to the self-training case 84.7%.
    Constrained Monotonic Neural Networks. (arXiv:2205.11775v1 [cs.LG])
    Deep neural networks are becoming increasingly popular in approximating arbitrary functions from noisy data. But wider adoption is being hindered by the need to explain such models and to impose additional constraints on them. Monotonicity constraint is one of the most requested properties in real-world scenarios and is the focus of this paper. One of the oldest ways to construct a monotonic fully connected neural network is to constrain its weights to be non-negative while employing a monotonic activation function. Unfortunately, this construction does not work with popular non-saturated activation functions such as ReLU, ELU, SELU etc, as it can only approximate convex functions. We show this shortcoming can be fixed by employing the original activation function for a part of the neurons in the layer, and employing its point reflection for the other part. Our experiments show this approach of building monotonic deep neural networks have matching or better accuracy when compared to other state-of-the-art methods such as deep lattice networks or monotonic networks obtained by heuristic regularization. This method is the simplest one in the sense of having the least number of parameters, not requiring any modifications to the learning procedure or steps post-learning steps.
    ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. (arXiv:2205.11728v1 [cs.IR])
    Learned embeddings for products are an important building block for web-scale e-commerce recommendation systems. At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases including user, image and search based recommendations. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost. While most prior work focuses on building product embeddings from features coming from a single modality, we introduce a transformer-based architecture capable of aggregating information from both text and image modalities and show that it significantly outperforms single modality baselines. We also utilize multi-task learning to make ItemSage optimized for several engagement types, leading to a candidate generation system that is efficient for all of the engagement objectives of the end-to-end recommendation system. Extensive offline experiments are conducted to illustrate the effectiveness of our approach and results from online A/B experiments show substantial gains in key business metrics (up to +7% gross merchandise value/user and +11% click volume).
    On the Role of Bidirectionality in Language Model Pre-Training. (arXiv:2205.11726v1 [cs.CL])
    Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.
    BabyBear: Cheap inference triage for expensive language models. (arXiv:2205.11747v1 [cs.CL])
    Transformer language models provide superior accuracy over previous models but they are computationally and environmentally expensive. Borrowing the concept of model cascading from computer vision, we introduce BabyBear, a framework for cascading models for natural language processing (NLP) tasks to minimize cost. The core strategy is inference triage, exiting early when the least expensive model in the cascade achieves a sufficiently high-confidence prediction. We test BabyBear on several open source data sets related to document classification and entity recognition. We find that for common NLP tasks a high proportion of the inference load can be accomplished with cheap, fast models that have learned by observing a deep learning model. This allows us to reduce the compute cost of large-scale classification jobs by more than 50% while retaining overall accuracy. For named entity recognition, we save 33% of the deep learning compute while maintaining an F1 score higher than 95% on the CoNLL benchmark.
    HiPAL: A Deep Framework for Physician Burnout Prediction Using Activity Logs in Electronic Health Records. (arXiv:2205.11680v1 [cs.LG])
    Burnout is a significant public health concern affecting nearly half of the healthcare workforce. This paper presents the first end-to-end deep learning framework for predicting physician burnout based on clinician activity logs, digital traces of their work activities, available in any electronic health record (EHR) system. In contrast to prior approaches that exclusively relied on surveys for burnout measurement, our framework directly learns deep workload representations from large-scale clinician activity logs to predict burnout. We propose the Hierarchical burnout Prediction based on Activity Logs (HiPAL), featuring a pre-trained time-dependent activity embedding mechanism tailored for activity logs and a hierarchical predictive model, which mirrors the natural hierarchical structure of clinician activity logs and captures physician's evolving workload patterns at both short-term and long-term levels. To utilize the large amount of unlabeled activity logs, we propose a semi-supervised framework that learns to transfer knowledge extracted from unlabeled clinician activities to the HiPAL-based prediction model. The experiment on over 15 million clinician activity logs collected from the EHR at a large academic medical center demonstrates the advantages of our proposed framework in predictive performance of physician burnout and training efficiency over state of the art approaches.
    Demand Response Method Considering Multiple Types of Flexible Loads in Industrial Parks. (arXiv:2205.11743v1 [eess.SY])
    With the rapid development of the energy internet, the proportion of flexible loads in smart grid is getting much higher than before. It is highly important to model flexible loads based on demand response. Therefore, a new demand response method considering multiple flexible loads is proposed in this paper to character the integrated demand response (IDR) resources. Firstly, a physical process analytical deduction (PPAD) model is proposed to improve the classification of flexible loads in industrial parks. Scenario generation, data point augmentation, and smooth curves under various operating conditions are considered to enhance the applicability of the model. Secondly, in view of the strong volatility and poor modeling effect of Wasserstein-generative adversarial networks (WGAN), an improved WGAN-gradient penalty (IWGAN-GP) model is developed to get a faster convergence speed than traditional WGAN and generate a higher quality samples. Finally, the PPAD and IWGAN-GP models are jointly implemented to reveal the degree of correlation between flexible loads. Meanwhile, an intelligent offline database is built to deal with the impact of nonlinear factors in different response scenarios. Numerical examples have been performed with the results proving that the proposed method is significantly better than the existing technologies in reducing load modeling deviation and improving the responsiveness of park loads.
    Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy. (arXiv:2106.01100v3 [eess.IV] UPDATED)
    During lung radiotherapy, the position of infrared reflective objects on the chest can be recorded to estimate the tumor location. However, radiotherapy systems have a latency inherent to robot control limitations that impedes the radiation delivery precision. Prediction with online learning of recurrent neural networks (RNN) allows for adaptation to non-stationary respiratory signals, but classical methods such as RTRL and truncated BPTT are respectively slow and biased. This study investigates the capabilities of unbiased online recurrent optimization (UORO) to forecast respiratory motion and enhance safety in lung radiotherapy. We used 9 observation records of the 3D position of 3 external markers on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The sampling frequency was 10Hz, and the amplitudes of the recorded trajectories range from 6mm to 40mm in the superior-inferior direction. We forecast the 3D location of each marker simultaneously with a horizon value between 0.1s and 2.0s, using an RNN trained with UORO. We compare its performance with an RNN trained with RTRL, LMS, and offline linear regression. We provide closed-form expressions for quantities involved in the gradient loss calculation in UORO, thereby making its implementation efficient. Training and cross-validation were performed during the first minute of each sequence. On average over the horizon values considered and the 9 sequences, UORO achieves the lowest root-mean-square (RMS) error and maximum error among the compared algorithms. These errors are respectively equal to 1.3mm and 8.8mm, and the prediction time per time step was lower than 2.8ms (Dell Intel core i9-9900K 3.60 GHz). Linear regression has the lowest RMS error for the horizon values 0.1s and 0.2s, followed by LMS for horizon values between 0.3s and 0.5s, and UORO for horizon values greater than 0.6s.
    Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable. (arXiv:2205.11716v1 [cs.LG])
    Recently, neural networks have been shown to perform exceptionally well in transforming two arbitrary sets into two linearly separable sets. Doing this with a randomly initialized neural network is of immense interest because the associated computation is cheaper than using fully trained networks. In this paper, we show that, with sufficient width, a randomly initialized one-layer neural network transforms two sets into two linearly separable sets with high probability. Furthermore, we provide explicit bounds on the required width of the neural network for this to occur. Our first bound is exponential in the input dimension and polynomial in all other parameters, while our second bound is independent of the input dimension, thereby overcoming the curse of dimensionality. We also perform an experimental study comparing the separation capacity of randomly initialized one-layer and two-layer neural networks. With correctly chosen biases, our study shows for low-dimensional data, the two-layer neural network outperforms the one-layer network. However, the opposite is observed for higher-dimensional data.
    Embedding Neighborhoods Simultaneously t-SNE (ENS-t-SNE). (arXiv:2205.11720v1 [cs.LG])
    We propose an algorithm for visualizing a dataset by embedding it in 3-dimensional Euclidean space based on various given distances between the same pairs of datapoints. Its aim is to find an Embedding which preserves Neighborhoods Simultaneously for all given distances by generalizing the t-Stochastic Neighborhood Embedding approach (ENS-t-SNE). We illustrate the utility of ENS-t-SNE by demonstrating its use in three applications. First, to visualize different notions of clusters and groups within the same high-dimensional dataset with one 3-dimensional embedding, as opposed to providing different embeddings of the same data and trying to match the corresponding points. Second, to illustrate the effects of different hyper-parameters of the classical t-SNE. Third, by considering multiple different notions of clustering in data, ENS-t-SNE can generate an alternative embedding than the classic t-SNE. We provide an extensive quantitative evaluation with real-world and synthetic datasets of different sizes and using different numbers of projections.
    High-Order Pooling for Graph Neural Networks with Tensor Decomposition. (arXiv:2205.11691v1 [cs.LG])
    Graph Neural Networks (GNNs) are attracting growing attention due to their effectiveness and flexibility in modeling a variety of graph-structured data. Exiting GNN architectures usually adopt simple pooling operations (e.g., sum, average, max) when aggregating messages from a local neighborhood for updating node representation or pooling node representations from the entire graph to compute the graph representation. Though simple and effective, these linear operations do not model high-order non-linear interactions among nodes. We propose the Tensorized Graph Neural Network (tGNN), a highly expressive GNN architecture relying on tensor decomposition to model high-order non-linear node interactions. tGNN leverages the symmetric CP decomposition to efficiently parameterize permutation-invariant multilinear maps for modeling node interactions. Theoretical and empirical analysis on both node and graph classification tasks show the superiority of tGNN over competitive baselines. In particular, tGNN achieves state-of-the-art results on two OGB node classification datasets and one OGB graph classification dataset.
    Semi-Parametric Deep Neural Networks in Linear Time and Memory. (arXiv:2205.11718v1 [cs.LG])
    Recent advances in deep learning have been driven by large-scale parametric models, which can be computationally expensive and lack interpretability. Semi-parametric methods query the training set at inference time and can be more compact, although they typically have quadratic computational complexity. Here, we introduce SPIN, a general-purpose semi-parametric neural architecture whose computational cost is linear in the size and dimensionality of the data. Our architecture is inspired by inducing point methods and relies on a novel application of cross-attention between datapoints. At inference time, its computational cost is constant in the training set size as the data gets distilled into a fixed number of inducing points. We find that our method reduces the computational requirements of existing semi-parametric models by up to an order of magnitude across a range of datasets and improves state-of-the-art performance on an important practical problem, genotype imputation.
    Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning. (arXiv:2205.12184v1 [cs.LG])
    Continuous-time reinforcement learning offers an appealing formalism for describing control problems in which the passage of time is not naturally divided into discrete increments. Here we consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time, stochastic environment. Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, learning state representations, multiagent coordination, and more. We begin by establishing the distributional analogue of the Hamilton-Jacobi-Bellman (HJB) equation for It\^o diffusions and the broader class of Feller-Dynkin processes. We then specialize this equation to the setting in which the return distribution is approximated by $N$ uniformly-weighted particles, a common design choice in distributional algorithms. Our derivation highlights additional terms due to statistical diffusivity which arise from the proper handling of distributions in the continuous-time setting. Based on this, we propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm. We demonstrate the effectiveness of such an algorithm in a synthetic control problem.
    Policy Compliance Detection via Expression Tree Inference. (arXiv:2205.12259v1 [cs.CL])
    Policy Compliance Detection (PCD) is a task we encounter when reasoning over texts, e.g. legal frameworks. Previous work to address PCD relies heavily on modeling the task as a special case of Recognizing Textual Entailment. Entailment is applicable to the problem of PCD, however viewing the policy as a single proposition, as opposed to multiple interlinked propositions, yields poor performance and lacks explainability. To address this challenge, more recent proposals for PCD have argued for decomposing policies into expression trees consisting of questions connected with logic operators. Question answering is used to obtain answers to these questions with respect to a scenario. Finally, the expression tree is evaluated in order to arrive at an overall solution. However, this work assumes expression trees are provided by experts, thus limiting its applicability to new policies. In this work, we learn how to infer expression trees automatically from policy texts. We ensure the validity of the inferred trees by introducing constrained decoding using a finite state automaton to ensure the generation of valid trees. We determine through automatic evaluation that 63% of the expression trees generated by our constrained generation model are logically equivalent to gold trees. Human evaluation shows that 88% of trees generated by our model are correct.
    Psychotic Relapse Prediction in Schizophrenia Patients using A Mobile Sensing-based Supervised Deep Learning Model. (arXiv:2205.12225v1 [cs.LG])
    Mobile sensing-based modeling of behavioral changes could predict an oncoming psychotic relapse in schizophrenia patients for timely interventions. Deep learning models could complement existing non-deep learning models for relapse prediction by modeling latent behavioral features relevant to the prediction. However, given the inter-individual behavioral differences, model personalization might be required for a predictive model. In this work, we propose RelapsePredNet, a Long Short-Term Memory (LSTM) neural network-based model for relapse prediction. The model is personalized for a particular patient by training using data from patients most similar to the given patient. Several demographics and baseline mental health scores were considered as personalization metrics to define patient similarity. We investigated the effect of personalization on training dataset characteristics, learned embeddings, and relapse prediction performance. We compared RelapsePredNet with a deep learning-based anomaly detection model for relapse prediction. Further, we investigated if RelapsePredNet could complement ClusterRFModel (a random forest model leveraging clustering and template features proposed in prior work) in a fusion model, by identifying latent behavioral features relevant for relapse prediction. The CrossCheck dataset consisting of continuous mobile sensing data obtained from 63 schizophrenia patients, each monitored for up to a year, was used for our evaluations. The proposed RelapsePredNet outperformed the deep learning-based anomaly detection model for relapse prediction. The F2 score for prediction were 0.21 and 0.52 in the full test set and the Relapse Test Set (consisting of data from patients who have had relapse only), respectively. These corresponded to a 29.4% and 38.8% improvement compared to the existing deep learning-based model for relapse prediction.
    Forecasting Multilinear Data via Transform-Based Tensor Autoregression. (arXiv:2205.12201v1 [cs.LG])
    In the era of big data, there is an increasing demand for new methods for analyzing and forecasting 2-dimensional data. The current research aims to accomplish these goals through the combination of time-series modeling and multilinear algebraic systems. We expand previous autoregressive techniques to forecast multilinear data, aptly named the L-Transform Tensor autoregressive (L-TAR for short). Tensor decompositions and multilinear tensor products have allowed for this approach to be a feasible method of forecasting. We achieve statistical independence between the columns of the observations through invertible discrete linear transforms, enabling a divide and conquer approach. We present an experimental validation of the proposed methods on datasets containing image collections, video sequences, sea surface temperature measurements, stock prices, and networks.
    Learning multi-scale functional representations of proteins from single-cell microscopy data. (arXiv:2205.11676v1 [q-bio.QM])
    Protein function is inherently linked to its localization within the cell, and fluorescent microscopy data is an indispensable resource for learning representations of proteins. Despite major developments in molecular representation learning, extracting functional information from biological images remains a non-trivial computational task. Current state-of-the-art approaches use autoencoder models to learn high-quality features by reconstructing images. However, such methods are prone to capturing noise and imaging artifacts. In this work, we revisit deep learning models used for classifying major subcellular localizations, and evaluate representations extracted from their final layers. We show that simple convolutional networks trained on localization classification can learn protein representations that encapsulate diverse functional information, and significantly outperform autoencoder-based models. We also propose a robust evaluation strategy to assess quality of protein representations across different scales of biological function.
    Compressing Deep Graph Neural Networks via Adversarial Knowledge Distillation. (arXiv:2205.11678v1 [cs.LG])
    Deep graph neural networks (GNNs) have been shown to be expressive for modeling graph-structured data. Nevertheless, the over-stacked architecture of deep graph models makes it difficult to deploy and rapidly test on mobile or embedded systems. To compress over-stacked GNNs, knowledge distillation via a teacher-student architecture turns out to be an effective technique, where the key step is to measure the discrepancy between teacher and student networks with predefined distance functions. However, using the same distance for graphs of various structures may be unfit, and the optimal distance formulation is hard to determine. To tackle these problems, we propose a novel Adversarial Knowledge Distillation framework for graph models named GraphAKD, which adversarially trains a discriminator and a generator to adaptively detect and decrease the discrepancy. Specifically, noticing that the well-captured inter-node and inter-class correlations favor the success of deep GNNs, we propose to criticize the inherited knowledge from node-level and class-level views with a trainable discriminator. The discriminator distinguishes between teacher knowledge and what the student inherits, while the student GNN works as a generator and aims to fool the discriminator. To our best knowledge, GraphAKD is the first to introduce adversarial training to knowledge distillation in graph domains. Experiments on node-level and graph-level classification benchmarks demonstrate that GraphAKD improves the student performance by a large margin. The results imply that GraphAKD can precisely transfer knowledge from a complicated teacher GNN to a compact student GNN.
    FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?. (arXiv:2205.11656v1 [cs.LG])
    The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. Training such models and exploring their hyperparameter space, however, is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, analysis has been limited to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model achieves 2.6x smaller size. FlexiBERT-Large, another proposed model, achieves state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.
    Improving Fairness for Data Valuation in Horizontal Federated Learning. (arXiv:2109.09046v3 [cs.LG] UPDATED)
    Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. The success of federated learning depends largely on the participation of data owners. To sustain and encourage data owners' participation, it is crucial to fairly evaluate the quality of the data provided by the data owners and reward them correspondingly. Federated Shapley value, recently proposed by Wang et al. [Federated Learning, 2020], is a measure for data value under the framework of federated learning that satisfies many desired properties for data valuation. However, there are still factors of potential unfairness in the design of federated Shapley value because two data owners with the same local data may not receive the same evaluation. We propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. The design depends on completing a matrix consisting of all the possible contributions by different subsets of the data owners. It is shown under mild conditions that this matrix is approximately low-rank by leveraging concepts and tools from optimization. Both theoretical analysis and empirical evaluation verify that the proposed measure does improve fairness in many circumstances.
    Intelligence Primer. (arXiv:2008.07324v3 [cs.AI] UPDATED)
    Intelligence is a fundamental part of all living things, as well as the foundation for Artificial Intelligence. In this primer we explore the ideas associated with intelligence and, by doing so, understand the implications and constraints and potentially outline the capabilities of future systems. Artificial Intelligence, in the form of Machine Learning, has already had a significant impact on our lives. As an exploration, we journey into different parts of intelligence that appear essential. We hope that people find this helpful in determining the future. Also, during the exploration, we hope to create new thought-provoking questions. Intelligence is not a single weighable quantity but a subject that spans Biology, Physics, Philosophy, Cognitive Science, Neuroscience, Psychology, and Computer Science. The historian Yuval Noah Harari pointed out that engineers and scientists in the future will have to broaden their understandings to include disciplines such as Psychology, Philosophy, and Ethics. Fiction writers have long portrayed engineers and scientists as deficient in these areas. Today, in modern society, the emergence of Artificial Intelligence and legal requirements act as forcing functions to push these broader subjects into the foreground. We start with an introduction to intelligence and move quickly to more profound thoughts and ideas. We call this a Life, the Universe, and Everything primer, after the famous science fiction book by Douglas Adams. Forty-two may be the correct answer, but what are the questions?
    Throwing Away Data Improves Worst-Class Error in Imbalanced Classification. (arXiv:2205.11672v1 [stat.ML])
    Class imbalances pervade classification problems, yet their treatment differs in theory and practice. On the one hand, learning theory instructs us that \emph{more data is better}, as sample size relates inversely to the average test error over the entire data distribution. On the other hand, practitioners have long developed a plethora of tricks to improve the performance of learning machines over imbalanced data. These include data reweighting and subsampling, synthetic construction of additional samples from minority classes, ensembling expensive one-versus all architectures, and tweaking classification losses and thresholds. All of these are efforts to minimize the worst-class error, which is often associated to the minority group in the training data, and finds additional motivation in the robustness, fairness, and out-of-distribution literatures. Here we take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data when fitted either on (i) the full training set, or (ii) a subset where the majority class is subsampled to match in size the minority class. We borrow tools from extreme value theory to show that, under distributions with certain tail properties, \emph{throwing away most data from the majority class leads to better worst-class error}.
    Eigenvalue and Generalized Eigenvalue Problems: Tutorial. (arXiv:1903.11240v2 [stat.ML] UPDATED)
    This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.
    Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment. (arXiv:2205.11616v1 [cs.CL])
    Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.
    Deep Representations for Time-varying Brain Datasets. (arXiv:2205.11648v1 [cs.LG])
    Finding an appropriate representation of dynamic activities in the brain is crucial for many downstream applications. Due to its highly dynamic nature, temporally averaged fMRI (functional magnetic resonance imaging) can only provide a narrow view of underlying brain activities. Previous works lack the ability to learn and interpret the latent dynamics in brain architectures. This paper builds an efficient graph neural network model that incorporates both region-mapped fMRI sequences and structural connectivities obtained from DWI (diffusion-weighted imaging) as inputs. We find good representations of the latent brain dynamics through learning sample-level adaptive adjacency matrices and performing a novel multi-resolution inner cluster smoothing. These modules can be easily adapted to and are potentially useful for other applications outside the neuroscience domain. We also attribute inputs with integrated gradients, which enables us to infer (1) highly involved brain connections and subnetworks for each task, (2) temporal keyframes of imaging sequences that characterize tasks, and (3) subnetworks that discriminate between individual subjects. This ability to identify critical subnetworks that characterize signal states across heterogeneous tasks and individuals is of great importance to neuroscience and other scientific domains. Extensive experiments and ablation studies demonstrate our proposed method's superiority and efficiency in spatial-temporal graph signal modeling with insightful interpretations of brain dynamics.
    A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature. (arXiv:2205.11651v1 [cs.DL])
    Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.
    DOGE-Train: Discrete Optimization on GPU with End-to-end Training. (arXiv:2205.11638v1 [cs.LG])
    We present a fast, scalable, data-driven approach for solving linear relaxations of 0-1 integer linear programs using a graph neural network. Our solver is based on the Lagrange decomposition based algorithm FastDOG (Abbas et al. (2022)). We make the algorithm differentiable and perform backpropagation through the dual update scheme for end-to-end training of its algorithmic parameters. This allows to preserve the algorithm's theoretical properties including feasibility and guaranteed non-decrease in the lower bound. Since FastDOG can get stuck in suboptimal fixed points, we provide additional freedom to our graph neural network to predict non-parametric update steps for escaping such points while maintaining dual feasibility. For training of the graph neural network we use an unsupervised loss and perform experiments on large-scale real world datasets. We train on smaller problems and test on larger ones showing strong generalization performance with a graph neural network comprising only around 10k parameters. Our solver achieves significantly faster performance and better dual objectives than its non-learned version. In comparison to commercial solvers our learned solver achieves close to optimal objective values of LP relaxations and is faster by up to an order of magnitude on very large problems from structured prediction and on selected combinatorial optimization problems.
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v1 [stat.ML])
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalizations of a popular class of probabilistic models - the Variational Auto-Encoder (VAE). We point out the two generalization gaps that can affect the generalization ability of VAEs and show that the over-fitting phenomenon is usually dominated by the amortized inference network. Based on this observation we propose a new training objective, inspired by the classic wake-sleep algorithm, to improve the generalizations properties of amortized inference. We also demonstrate how it can improve generalization performance in the context of image modeling and lossless compression.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v1 [cs.LG])
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v1 [cs.LG])
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\Theta$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.
    PrivFairFL: Privacy-Preserving Group Fairness in Federated Learning. (arXiv:2205.11584v1 [cs.LG])
    Group fairness ensures that the outcome of machine learning (ML) based decision making systems are not biased towards a certain group of people defined by a sensitive attribute such as gender or ethnicity. Achieving group fairness in Federated Learning (FL) is challenging because mitigating bias inherently requires using the sensitive attribute values of all clients, while FL is aimed precisely at protecting privacy by not giving access to the clients' data. As we show in this paper, this conflict between fairness and privacy in FL can be resolved by combining FL with Secure Multiparty Computation (MPC) and Differential Privacy (DP). In doing so, we propose a method for training group-fair ML models in cross-device FL under complete and formal privacy guarantees, without requiring the clients to disclose their sensitive attribute values.
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v1 [stat.ML])
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
    Cardiomegaly Detection using Deep Convolutional Neural Network with U-Net. (arXiv:2205.11515v1 [eess.IV])
    Cardiomegaly is indeed a medical disease in which the heart is enlarged. Cardiomegaly is better to handle if caught early, so early detection is critical. The chest X-ray, being one of the most often used radiography examinations, has been used to detect and visualize abnormalities of human organs for decades. X-ray is also a significant medical diagnosis tool for cardiomegaly. Even for domain experts, distinguishing the many types of diseases from the X-ray is a difficult and time-consuming task. Deep learning models are also most effective when used on huge data sets, yet due to privacy concerns, large datasets are rarely available inside the medical industry. A Deep learning-based customized retrained U-Net model for detecting Cardiomegaly disease is presented in this research. In the training phase, chest X-ray images from the "ChestX-ray8" open source real dataset are used. To reduce computing time, this model performs data preprocessing, picture improvement, image compression, and classification before moving on to the training step. The work used a chest x-ray image dataset to simulate and produced a diagnostic accuracy of 94%, a sensitivity of 96.2 percent, and a specificity of 92.5 percent, which beats prior pre-trained model findings for identifying Cardiomegaly disease.
    BolT: Fused Window Transformers for fMRI Time Series Analysis. (arXiv:2205.11578v1 [eess.SP])
    Functional magnetic resonance imaging (fMRI) enables examination of inter-regional interactions in the brain via functional connectivity (FC) analyses that measure the synchrony between the temporal activations of separate regions. Given their exceptional sensitivity, deep-learning methods have received growing interest for FC analyses of high-dimensional fMRI data. In this domain, models that operate directly on raw time series as opposed to pre-computed FC features have the potential benefit of leveraging the full scale of information present in fMRI data. However, previous models are based on architectures suboptimal for temporal integration of representations across multiple time scales. Here, we present BolT, blood-oxygen-level-dependent transformer, for analyzing multi-variate fMRI time series. BolT leverages a cascade of transformer encoders equipped with a novel fused window attention mechanism. Transformer encoding is performed on temporally-overlapped time windows within the fMRI time series to capture short time-scale representations. To integrate information across windows, cross-window attention is computed between base tokens in each time window and fringe tokens from neighboring time windows. To transition from local to global representations, the extent of window overlap and thereby number of fringe tokens is progressively increased across the cascade. Finally, a novel cross-window regularization is enforced to align the high-level representations of global $CLS$ features across time windows. Comprehensive experiments on public fMRI datasets clearly illustrate the superior performance of BolT against state-of-the-art methods. Posthoc explanatory analyses to identify landmark time points and regions that contribute most significantly to model decisions corroborate prominent neuroscientific findings from recent fMRI studies.
    Privacy-preserving Data Filtering in Federated Learning Using Influence Approximation. (arXiv:2205.11518v1 [cs.CR])
    Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation that can be implemented in a privacy-preserving manner. Each agent uses his own data to evaluate the influence of another agent's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for almost perfect ($>92\%$ recall) filtering of corrupted data in a variety of applications using real-data. Importantly, the accuracy does not degrade significantly, even under really strong privacy guarantees ($\varepsilon \leq 1$), especially under realistic percentages of mislabeled data (for $15\%$ mislabeled data we only lose $10\%$ in accuracy).
    Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums. (arXiv:2205.12173v1 [cs.LG])
    Byzantine resilience emerged as a prominent topic within the distributed machine learning community. Essentially, the goal is to enhance distributed optimization algorithms, such as distributed SGD, in a way that guarantees convergence despite the presence of some misbehaving (a.k.a., {\em Byzantine}) workers. Although a myriad of techniques addressing the problem have been proposed, the field arguably rests on fragile foundations. These techniques are hard to prove correct and rely on assumptions that are (a) quite unrealistic, i.e., often violated in practice, and (b) heterogeneous, i.e., making it difficult to compare approaches. We present \emph{RESAM (RESilient Averaging of Momentums)}, a unified framework that makes it simple to establish optimal Byzantine resilience, relying only on standard machine learning assumptions. Our framework is mainly composed of two operators: \emph{resilient averaging} at the server and \emph{distributed momentum} at the workers. We prove a general theorem stating the convergence of distributed SGD under RESAM. Interestingly, demonstrating and comparing the convergence of many existing techniques become direct corollaries of our theorem, without resorting to stringent assumptions. We also present an empirical evaluation of the practical relevance of RESAM.
    Learning for Expressive Task-Related Sentence Representations. (arXiv:2205.12186v1 [cs.CL])
    NLP models learn sentence representations for downstream tasks by tuning a model which is pre-trained by masked language modeling. However, after tuning, the learned sentence representations may be skewed heavily toward label space and thus are not expressive enough to represent whole samples, which should contain task-related information of both sentence inputs and labels. In this work, we learn expressive sentence representations for supervised tasks which (1). contain task-related information in the sentence inputs, and (2). enable correct label predictions. To achieve this goal, we first propose a new objective which explicitly points out the label token space in the input, and predicts categories of labels via an added [MASK] token. This objective encourages fusing the semantic information of both the label and sentence. Then we develop a neighbor attention module, added on a frozen pre-trained model, to build connections between label/sentence tokens via their neighbors. The propagation can be further guided by the regularization on neighborhood representations to encourage expressiveness. Experimental results show that, despite tuning only 5% additional parameters over a frozen pre-trained model, our model can achieve classification results comparable to the SOTA while maintaining strong expressiveness as well.
    Compression-aware Training of Neural Networks using Frank-Wolfe. (arXiv:2205.11921v1 [cs.LG])
    Many existing Neural Network pruning approaches either rely on retraining to compensate for pruning-caused performance degradation or they induce strong biases to converge to a specific sparse solution throughout training. A third paradigm obtains a wide range of compression ratios from a single dense training run while also avoiding retraining. Recent work of Pokutta et al. (2020) and Miao et al. (2022) suggests that the Stochastic Frank-Wolfe (SFW) algorithm is particularly suited for training state-of-the-art models that are robust to compression. We propose leveraging $k$-support norm ball constraints and demonstrate significant improvements over the results of Miao et al. (2022) in the case of unstructured pruning. We also extend these ideas to the structured pruning domain and propose novel approaches to both ensure robustness to the pruning of convolutional filters as well as to low-rank tensor decompositions of convolutional layers. In the latter case, our approach performs on-par with nuclear-norm regularization baselines while requiring only half of the computational resources. Our findings also indicate that the robustness of SFW-trained models largely depends on the gradient rescaling of the learning rate and we establish a theoretical foundation for that practice.
    Neur2SP: Neural Two-Stage Stochastic Programming. (arXiv:2205.12006v1 [math.OC])
    Stochastic programming is a powerful modeling framework for decision-making under uncertainty. In this work, we tackle two-stage stochastic programs (2SPs), the most widely applied and studied class of stochastic programming models. Solving 2SPs exactly requires evaluation of an expected value function that is computationally intractable. Additionally, having a mixed-integer linear program (MIP) or a nonlinear program (NLP) in the second stage further aggravates the problem difficulty. In such cases, solving them can be prohibitively expensive even if specialized algorithms that exploit problem structure are employed. Finding high-quality (first-stage) solutions -- without leveraging problem structure -- can be crucial in such settings. We develop Neur2SP, a new method that approximates the expected value function via a neural network to obtain a surrogate model that can be solved more efficiently than the traditional extensive formulation approach. Moreover, Neur2SP makes no assumptions about the problem structure, in particular about the second-stage problem, and can be implemented using an off-the-shelf solver and open-source libraries. Our extensive computational experiments on benchmark 2SP datasets from four problem classes with different structures (containing MIP and NLP second-stage problems) show the efficiency (time) and efficacy (solution quality) of Neur2SP. Specifically, the proposed method takes less than 1.66 seconds across all problems, achieving high-quality solutions even as the number of scenarios increases, an ideal property that is difficult to have for traditional 2SP solution techniques. Namely, the most generic baseline method typically requires minutes to hours to find solutions of comparable quality.
    Deep Low-Density Separation for Semi-Supervised Classification. (arXiv:2205.11995v1 [cs.LG])
    Given a small set of labeled data and a large set of unlabeled data, semi-supervised learning (SSL) attempts to leverage the location of the unlabeled datapoints in order to create a better classifier than could be obtained from supervised methods applied to the labeled training set alone. Effective SSL imposes structural assumptions on the data, e.g. that neighbors are more likely to share a classification or that the decision boundary lies in an area of low density. For complex and high-dimensional data, neural networks can learn feature embeddings to which traditional SSL methods can then be applied in what we call hybrid methods. Previously-developed hybrid methods iterate between refining a latent representation and performing graph-based SSL on this representation. In this paper, we introduce a novel hybrid method that instead applies low-density separation to the embedded features. We describe it in detail and discuss why low-density separation may be better suited for SSL on neural network-based embeddings than graph-based algorithms. We validate our method using in-house customer survey data and compare it to other state-of-the-art learning methods. Our approach effectively classifies thousands of unlabeled users from a relatively small number of hand-classified examples.
    Ensemble Multi-Relational Graph Neural Networks. (arXiv:2205.12076v1 [cs.LG])
    It is well established that graph neural networks (GNNs) can be interpreted and designed from the perspective of optimization objective. With this clear optimization objective, the deduced GNNs architecture has sound theoretical foundation, which is able to flexibly remedy the weakness of GNNs. However, this optimization objective is only proved for GNNs with single-relational graph. Can we infer a new type of GNNs for multi-relational graphs by extending this optimization objective, so as to simultaneously solve the issues in previous multi-relational GNNs, e.g., over-parameterization? In this paper, we propose a novel ensemble multi-relational GNNs by designing an ensemble multi-relational (EMR) optimization objective. This EMR optimization objective is able to derive an iterative updating rule, which can be formalized as an ensemble message passing (EnMP) layer with multi-relations. We further analyze the nice properties of EnMP layer, e.g., the relationship with multi-relational personalized PageRank. Finally, a new multi-relational GNNs which well alleviate the over-smoothing and over-parameterization issues are proposed. Extensive experiments conducted on four benchmark datasets well demonstrate the effectiveness of the proposed model.
    Faithful Explanations for Deep Graph Models. (arXiv:2205.11850v1 [cs.LG])
    This paper studies faithful explanations for Graph Neural Networks (GNNs). First, we provide a new and general method for formally characterizing the faithfulness of explanations for GNNs. It applies to existing explanation methods, including feature attributions and subgraph explanations. Second, our analytical and empirical results demonstrate that feature attribution methods cannot capture the nonlinear effect of edge features, while existing subgraph explanation methods are not faithful. Third, we introduce \emph{k-hop Explanation with a Convolutional Core} (KEC), a new explanation method that provably maximizes faithfulness to the original GNN by leveraging information about the graph structure in its adjacency matrix and its \emph{k-th} power. Lastly, our empirical results over both synthetic and real-world datasets for classification and anomaly detection tasks with GNNs demonstrate the effectiveness of our approach.
    Quadratic models for understanding neural network dynamics. (arXiv:2205.11787v1 [cs.LG])
    In this work, we propose using a quadratic model as a tool for understanding properties of wide neural networks in both optimization and generalization. We show analytically that certain deep learning phenomena such as the "catapult phase" from [Lewkowycz et al. 2020], which cannot be captured by linear models, are manifested in the quadratic model for shallow ReLU networks. Furthermore, our empirical results indicate that the behaviour of quadratic models parallels that of neural networks in generalization, especially in the large learning rate regime. We expect that quadratic models will serve as a useful tool for analysis of neural networks.
    Concurrent Credit Assignment for Data-efficient Reinforcement Learning. (arXiv:2205.12020v1 [cs.LG])
    The capability to widely sample the state and action spaces is a key ingredient toward building effective reinforcement learning algorithms. The variational optimization principles exposed in this paper emphasize the importance of an occupancy model to synthesizes the general distribution of the agent's environmental states over which it can act (defining a virtual ``territory''). The occupancy model is the subject of frequent updates as the exploration progresses and that new states are undisclosed during the course of the training. By making a uniform prior assumption, the resulting objective expresses a balance between two concurrent tendencies, namely the widening of the occupancy space and the maximization of the rewards, reminding of the classical exploration/exploitation trade-off. Implemented on an actor-critic off-policy on classic continuous action benchmarks, it is shown to provide significant increase in the sampling efficacy, that is reflected in a reduced training time and higher returns, in both the dense and the sparse rewards cases.
    Quantum Kerr Learning. (arXiv:2205.12004v1 [quant-ph])
    Quantum machine learning is a rapidly evolving area that could facilitate important applications for quantum computing and significantly impact data science. In our work, we argue that a single Kerr mode might provide some extra quantum enhancements when using quantum kernel methods based on various reasons from complexity theory and physics. Furthermore, we establish an experimental protocol, which we call \emph{quantum Kerr learning} based on circuit QED. A detailed study using the kernel method, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, shows quantum enhancements could happen in terms of the convergence time and the generalization error, while explicit protocols are also constructed for higher-dimensional input data.
    Pynblint: a Static Analyzer for Python Jupyter Notebooks. (arXiv:2205.11934v1 [cs.SE])
    Jupyter Notebook is the tool of choice of many data scientists in the early stages of ML workflows. The notebook format, however, has been criticized for inducing bad programming practices; indeed, researchers have already shown that open-source repositories are inundated by poor-quality notebooks. Low-quality output from the prototypical stages of ML workflows constitutes a clear bottleneck towards the productization of ML models. To foster the creation of better notebooks, we developed Pynblint, a static analyzer for Jupyter notebooks written in Python. The tool checks the compliance of notebooks (and surrounding repositories) with a set of empirically validated best practices and provides targeted recommendations when violations are detected.
    Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture. (arXiv:2205.11786v1 [cs.LG])
    In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum in-degree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.
    Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free. (arXiv:2205.11819v1 [cs.LG])
    Trojan attacks threaten deep neural networks (DNNs) by poisoning them to behave normally on most samples, yet to produce manipulated results for inputs attached with a particular trigger. Several works attempt to detect whether a given DNN has been injected with a specific trigger during the training. In a parallel line of research, the lottery ticket hypothesis reveals the existence of sparse subnetworks which are capable of reaching competitive performance as the dense network after independent training. Connecting these two dots, we investigate the problem of Trojan DNN detection from the brand new lens of sparsity, even when no clean training data is available. Our crucial observation is that the Trojan features are significantly more stable to network pruning than benign features. Leveraging that, we propose a novel Trojan network detection regime: first locating a "winning Trojan lottery ticket" which preserves nearly full Trojan information yet only chance-level performance on clean inputs; then recovering the trigger embedded in this already isolated subnetwork. Extensive experiments on various datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, with different network architectures, i.e., VGG-16, ResNet-18, ResNet-20s, and DenseNet-100 demonstrate the effectiveness of our proposal. Codes are available at https://github.com/VITA-Group/Backdoor-LTH.
    Functional Network: A Novel Framework for Interpretability of Deep Neural Networks. (arXiv:2205.11702v1 [cs.LG])
    The layered structure of deep neural networks hinders the use of numerous analysis tools and thus the development of its interpretability. Inspired by the success of functional brain networks, we propose a novel framework for interpretability of deep neural networks, that is, the functional network. We construct the functional network of fully connected networks and explore its small-worldness. In our experiments, the mechanisms of regularization methods, namely, batch normalization and dropout, are revealed using graph theoretical analysis and topological data analysis. Our empirical analysis shows the following: (1) Batch normalization enhances model performance by increasing the global e ciency and the number of loops but reduces adversarial robustness by lowering the fault tolerance. (2) Dropout improves generalization and robustness of models by improving the functional specialization and fault tolerance. (3) The models with dierent regularizations can be clustered correctly according to their functional topological dierences, re ecting the great potential of the functional network and topological data analysis in interpretability.
    Soft-SVM Regression For Binary Classification. (arXiv:2205.11735v1 [stat.ML])
    The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.
    RCC-GAN: Regularized Compound Conditional GAN for Large-Scale Tabular Data Synthesis. (arXiv:2205.11693v1 [cs.LG])
    This paper introduces a novel generative adversarial network (GAN) for synthesizing large-scale tabular databases which contain various features such as continuous, discrete, and binary. Technically, our GAN belongs to the category of class-conditioned generative models with a predefined conditional vector. However, we propose a new formulation for deriving such a vector incorporating both binary and discrete features simultaneously. We refer to this noble definition as compound conditional vector and employ it for training the generator network. The core architecture of this network is a three-layered deep residual neural network with skip connections. For improving the stability of such complex architecture, we present a regularization scheme towards limiting unprecedented variations on its weight vectors during training. This regularization approach is quite compatible with the nature of adversarial training and it is not computationally prohibitive in runtime. Furthermore, we constantly monitor the variation of the weight vectors for identifying any potential instabilities or irregularities to measure the strength of our proposed regularizer. Toward this end, we also develop a new metric for tracking sudden perturbation on the weight vectors using the singular value decomposition theory. Finally, we evaluate the performance of our proposed synthesis approach on six benchmarking tabular databases, namely Adult, Census, HCDR, Cabs, News, and King. The achieved results corroborate that for the majority of the cases, our proposed RccGAN outperforms other conventional and modern generative models in terms of accuracy, stability, and reliability.
    Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees. (arXiv:2205.11765v1 [cs.LG])
    We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark that our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server. The code for evaluation is provided in https://github.com/wanglun1996/secure-robust-federated-learning.
    Towards a Defense against Backdoor Attacks in Continual Federated Learning. (arXiv:2205.11736v1 [cs.LG])
    Backdoor attacks are a major concern in federated learning (FL) pipelines where training data is sourced from untrusted clients over long periods of time (i.e., continual learning). Preventing such attacks is difficult because defenders in FL do not have access to raw training data. Moreover, in a phenomenon we call backdoor leakage, models trained continuously eventually suffer from backdoors due to cumulative errors in backdoor defense mechanisms. We propose a novel framework for defending against backdoor attacks in the federated continual learning setting. Our framework trains two models in parallel: a backbone model and a shadow model. The backbone is trained without any defense mechanism to obtain good performance on the main task. The shadow model combines recent ideas from robust covariance estimation-based filters with early-stopping to control the attack success rate even as the data distribution changes. We provide theoretical motivation for this design and show experimentally that our framework significantly improves upon existing defenses against backdoor attacks.
    MOSPAT: AutoML based Model Selection and Parameter Tuning for Time Series Anomaly Detection. (arXiv:2205.11755v1 [cs.LG])
    Organizations leverage anomaly and changepoint detection algorithms to detect changes in user behavior or service availability and performance. Many off-the-shelf detection algorithms, though effective, cannot readily be used in large organizations where thousands of users monitor millions of use cases and metrics with varied time series characteristics and anomaly patterns. The selection of algorithm and parameters needs to be precise for each use case: manual tuning does not scale, and automated tuning requires ground truth, which is rarely available. In this paper, we explore MOSPAT, an end-to-end automated machine learning based approach for model and parameter selection, combined with a generative model to produce labeled data. Our scalable end-to-end system allows individual users in large organizations to tailor time-series monitoring to their specific use case and data characteristics, without expert knowledge of anomaly detection algorithms or laborious manual labeling. Our extensive experiments on real and synthetic data demonstrate that this method consistently outperforms using any single algorithm.
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold. (arXiv:2205.11677v1 [stat.ML])
    The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to stochastic model of networks and semidefinite program research.
    PCA-Boosted Autoencoders for Nonlinear Dimensionality Reduction in Low Data Regimes. (arXiv:2205.11673v1 [cs.LG])
    Autoencoders (AE) provide a useful method for nonlinear dimensionality reduction but are ill-suited for low data regimes. Conversely, Principal Component Analysis (PCA) is data-efficient but is limited to linear dimensionality reduction, posing a problem when data exhibits inherent nonlinearity. This presents a challenge in various scientific and engineering domains such as the nanophotonic component design, where data exhibits nonlinear features while being expensive to obtain due to costly real measurements or resource-consuming solutions of partial differential equations. To address this difficulty, we propose a technique that harnesses the best of both worlds: an autoencoder that leverages PCA to perform well on scarce nonlinear data. Specifically, we outline a numerically robust PCA-based initialization of AE, which, together with the parameterized ReLU activation function, allows the training process to start from an exact PCA solution and improve upon it. A synthetic example is presented first to study the effects of data nonlinearity and size on the performance of the proposed method. We then evaluate our method on several nanophotonic component design problems where obtaining useful data is expensive. To demonstrate universality, we also apply it to tasks in other scientific domains: a benchmark breast cancer dataset and a gene expression dataset. We show that our proposed approach is substantially better than both PCA and randomly initialized AE in the majority of low-data regime cases we consider, or at least is comparable to the best of either of the other two methods.
    Machine Learning for Electricity Market Clearing. (arXiv:2205.11641v1 [eess.SY])
    This paper seeks to design a machine learning twin of the optimal power flow (OPF) optimization, which is used in market-clearing procedures by wholesale electricity markets. The motivation for the proposed approach stems from the need to obtain the digital twin, which is much faster than the original, while also being sufficiently accurate and producing consistent generation dispatches and locational marginal prices (LMPs), which are primal and dual solutions of the OPF optimization, respectively. Availability of market-clearing tools based on this approach will enable computationally tractable evaluation of multiple dispatch scenarios under a given unit commitment. Rather than direct solution of OPF, the Karush-Kuhn-Tucker (KKT) conditions for the OPF problem in question may be written, and in parallel the LMPs of generators and loads may be expressed in terms of the OPF Lagrangian multipliers. Also, taking advantage of the practical fact that many of the Lagrangian multipliers associated with lines will be zero (thermal limits are not binding), we build and train an ML scheme which maps flexible resources (loads and renewables) to the binding lines, and supplement it with an efficient power-grid aware linear map to optimal dispatch and LMPs. The scheme is validated and illustrated on IEEE models. We also report a trade of analysis between quality of the reconstruction and number of samples needed to train the model.
    FedSA: Accelerating Intrusion Detection in Collaborative Environments with Federated Simulated Annealing. (arXiv:2205.11519v1 [cs.CR])
    Fast identification of new network attack patterns is crucial for improving network security. Nevertheless, identifying an ongoing attack in a heterogeneous network is a non-trivial task. Federated learning emerges as a solution to collaborative training for an Intrusion Detection System (IDS). The federated learning-based IDS trains a global model using local machine learning models provided by federated participants without sharing local data. However, optimization challenges are intrinsic to federated learning. This paper proposes the Federated Simulated Annealing (FedSA) metaheuristic to select the hyperparameters and a subset of participants for each aggregation round in federated learning. FedSA optimizes hyperparameters linked to the global model convergence. The proposal reduces aggregation rounds and speeds up convergence. Thus, FedSA accelerates learning extraction from local models, requiring fewer IDS updates. The proposal assessment shows that the FedSA global model converges in less than ten communication rounds. The proposal requires up to 50% fewer aggregation rounds to achieve approximately 97% accuracy in attack detection than the conventional aggregation approach.
    Forecasting of Non-Stationary Sales Time Series Using Deep Learning. (arXiv:2205.11636v1 [cs.LG])
    The paper describes the deep learning approach for forecasting non-stationary time series with using time trend correction in a neural network model. Along with the layers for predicting sales values, the neural network model includes a subnetwork block for the prediction weight for a time trend term which is added to a predicted sales value. The time trend term is considered as a product of the predicted weight value and normalized time value. The results show that the forecasting accuracy can be essentially improved for non-stationary sales with time trends using the trend correction block in the deep learning model.
    An interpretation of the final fully connected layer. (arXiv:2205.11908v1 [cs.LG])
    In recent years neural networks have achieved state-of-the-art accuracy for various tasks but the the interpretation of the generated outputs still remains difficult. In this work we attempt to provide a method to understand the learnt weights in the final fully connected layer in image classification models. We motivate our method by drawing a connection between the policy gradient objective in RL and supervised learning objective. We suggest that the commonly used cross entropy based supervised learning objective can be regarded as a special case of the policy gradient objective. Using this insight we propose a method to find the most discriminative and confusing parts of an image. Our method does not make any prior assumption about neural network achitecture and has low computational cost. We apply our method on publicly available pre-trained models and report the generated results.
    Identifying (anti-)skyrmions while they form. (arXiv:2205.11535v1 [cond-mat.str-el])
    We use a Convolutional Neural Network (CNN) to identify the relevant features in the thermodynamical phases of a simulated three-dimensional spin-lattice system with ferromagnetic and Dzyaloshinskii-Moriya (DM) interactions. Such features include (anti-)skyrmions, merons, and helical and ferromagnetic states. We use a multi-label classification framework, which is flexible enough to accommodate states that mix different features and phases. We then train the CNN to predict the features of the final state from snapshots of intermediate states of the simulation. The trained model allows identifying the different phases reliably and early in the formation process. Thus, the CNN can significantly speed up the phase diagram calculations by predicting the final phase before the spin-lattice Monte Carlo sampling has converged. We show the prowess of this approach by generating phase diagrams with significantly shorter simulation times.
    Learning Context-Aware Service Representation for Service Recommendation in Workflow Composition. (arXiv:2205.11771v1 [cs.SE])
    As increasingly more software services have been published onto the Internet, it remains a significant challenge to recommend suitable services to facilitate scientific workflow composition. This paper proposes a novel NLP-inspired approach to recommending services throughout a workflow development process, based on incrementally learning latent service representation from workflow provenance. A workflow composition process is formalized as a step-wise, context-aware service generation procedure, which is mapped to next-word prediction in a natural language sentence. Historical service dependencies are extracted from workflow provenance to build and enrich a knowledge graph. Each path in the knowledge graph reflects a scenario in a data analytics experiment, which is analogous to a sentence in a conversation. All paths are thus formalized as composable service sequences and are mined, using various patterns, from the established knowledge graph to construct a corpus. Service embeddings are then learned by applying deep learning model from the NLP field. Extensive experiments on the real-world dataset demonstrate the effectiveness and efficiency of the approach.
    G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. (arXiv:2205.11796v1 [cs.CV])
    Arbitrary-oriented object representations contain the oriented bounding box (OBB), quadrilateral bounding box (QBB), and point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as the boundary discontinuity, square-like problem, representation ambiguity, and isolated points, which lead to inaccurate detection. Although many effective strategies have been proposed for various representations, there is still no unified solution. Current detection methods based on Gaussian modeling have demonstrated the possibility of breaking this dilemma; however, they remain limited to OBB. To go further, in this paper, we propose a unified Gaussian representation called G-Rep to construct Gaussian distributions for OBB, QBB, and PointSet, which achieves a unified solution to various representations and problems. Specifically, PointSet or QBB-based objects are converted into Gaussian distributions, and their parameters are optimized using the maximum likelihood estimation algorithm. Then, three optional Gaussian metrics are explored to optimize the regression loss of the detector because of their excellent parameter optimization mechanisms. Furthermore, we also use Gaussian metrics for sampling to align label assignment and regression loss. Experimental results on several public available datasets, DOTA, HRSC2016, UCAS-AOD, and ICDAR2015 show the excellent performance of the proposed method for arbitrary-oriented object detection. The code has been open sourced at https://github.com/open-mmlab/mmrotate.
    Identifying Patient-Specific Root Causes of Disease. (arXiv:2205.11627v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI.
  • Open

    Forecasting Multilinear Data via Transform-Based Tensor Autoregression. (arXiv:2205.12201v1 [cs.LG])
    In the era of big data, there is an increasing demand for new methods for analyzing and forecasting 2-dimensional data. The current research aims to accomplish these goals through the combination of time-series modeling and multilinear algebraic systems. We expand previous autoregressive techniques to forecast multilinear data, aptly named the L-Transform Tensor autoregressive (L-TAR for short). Tensor decompositions and multilinear tensor products have allowed for this approach to be a feasible method of forecasting. We achieve statistical independence between the columns of the observations through invertible discrete linear transforms, enabling a divide and conquer approach. We present an experimental validation of the proposed methods on datasets containing image collections, video sequences, sea surface temperature measurements, stock prices, and networks.
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v2 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. (arXiv:2202.11559v2 [physics.comp-ph] UPDATED)
    Parameter reconstructions are indispensable in metrology. Here, the objective is to to explain $K$ experimental measurements by fitting to them a parameterized model of the measurement process. The model parameters are regularly determined by least-square methods, i.e., by minimizing the sum of the squared residuals between the $K$ model predictions and the $K$ experimental observations, $\chi^2$. The model functions often involve computationally demanding numerical simulations. Bayesian optimization methods are specifically suited for minimizing expensive model functions. However, in contrast to least-square methods such as the Levenberg-Marquardt algorithm, they only take the value of $\chi^2$ into account, and neglect the $K$ individual model outputs. We present a Bayesian target-vector optimization scheme with improved performance over previous developments, that considers all $K$ contributions of the model function and that is specifically suited for parameter reconstruction problems which are often based on hundreds of observations. Its performance is compared to established methods for an optical metrology reconstruction problem and two synthetic least-squares problems. The proposed method outperforms established optimization methods. It also enables to determine accurate uncertainty estimates with very few observations of the actual model function by using Markov chain Monte Carlo sampling on a trained surrogate model.
    SepIt Approaching a Single Channel Speech Separation Bound. (arXiv:2205.11801v1 [eess.AS])
    We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
    One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks. (arXiv:2205.12141v1 [cs.LG])
    Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Error-minimizing noise, which is injected to clean data, is one of the most successful methods for preventing DNNs from giving correct predictions on incoming new data. Nonetheless, under specific training strategies such as adversarial training, the unlearnability of error-minimizing noise will severely degrade. In addition, the transferability of error-minimizing noise is inherently limited by the mismatch between the generator model and the targeted learner model. In this paper, we investigate the mechanism of unlearnable examples and propose a novel model-free method, named \emph{One-Pixel Shortcut}, which only perturbs a single pixel of each image and makes the dataset unlearnable. Our method needs much less computational cost and obtains stronger transferability and thus can protect data from a wide range of different models. Based on this, we further introduce the first unlearnable dataset called CIFAR-10-S, which is indistinguishable from normal CIFAR-10 by human observers and can serve as a benchmark for different models or training strategies to evaluate their abilities to extract critical features from the disturbance of non-semantic representations. The original error-minimizing ULEs will lose efficiency under adversarial training, where the model can get over 83\% clean test accuracy. Meanwhile, even if adversarial training and strong data augmentation like RandAugment are applied together, the model trained on CIFAR-10-S cannot get over 50\% clean test accuracy.
    Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey. (arXiv:2101.00734v2 [stat.ML] UPDATED)
    This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They assume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.
    On the identifiability of mixtures of ranking models. (arXiv:2201.13132v2 [cs.LG] UPDATED)
    Mixtures of ranking models are standard tools for ranking problems. However, even the fundamental question of parameter identifiability is not fully understood: the identifiability of a mixture model with two Bradley-Terry-Luce (BTL) components has remained open. In this work, we show that popular mixtures of ranking models with two components (BTL, multinomial logistic models with slates of size 3, or Plackett-Luce) are generically identifiable, i.e., the ground-truth parameters can be identified except when they are from a pathological subset of measure zero. We provide a framework for verifying the number of solutions in a general family of polynomial systems using algebraic geometry, and apply it to these mixtures of ranking models to establish generic identifiability. The framework can be applied more broadly to other learning models and may be of independent interest.
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v2 [cs.LG] UPDATED)
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
    Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable. (arXiv:2205.11716v1 [cs.LG])
    Recently, neural networks have been shown to perform exceptionally well in transforming two arbitrary sets into two linearly separable sets. Doing this with a randomly initialized neural network is of immense interest because the associated computation is cheaper than using fully trained networks. In this paper, we show that, with sufficient width, a randomly initialized one-layer neural network transforms two sets into two linearly separable sets with high probability. Furthermore, we provide explicit bounds on the required width of the neural network for this to occur. Our first bound is exponential in the input dimension and polynomial in all other parameters, while our second bound is independent of the input dimension, thereby overcoming the curse of dimensionality. We also perform an experimental study comparing the separation capacity of randomly initialized one-layer and two-layer neural networks. With correctly chosen biases, our study shows for low-dimensional data, the two-layer neural network outperforms the one-layer network. However, the opposite is observed for higher-dimensional data.
    Weak Convergence of Approximate reflection coupling and its Application to Non-convex Optimization. (arXiv:2205.11970v1 [math.PR])
    In this paper, we propose a weak approximation of the reflection coupling (RC) for stochastic differential equations (SDEs), and prove it converges weakly to the desired coupling. In contrast to the RC, the proposed approximate reflection coupling (ARC) need not take the hitting time of processes to the diagonal set into consideration and can be defined as the solution of some SDEs on the whole time interval. Therefore, ARC can work effectively against SDEs with different drift terms. As an application of ARC, an evaluation on the effectiveness of the stochastic gradient descent in a non-convex setting is also described. For the sample size $n$, the step size $\eta$, and the batch size $B$, we derive uniform evaluations on the time with orders $n^{-1}$, $\eta^{1/2}$, and $\sqrt{(n - B) / B (n - 1)}$, respectively.
    Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs. (arXiv:2205.11894v1 [cs.LG])
    We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. We introduce a new model that decomposes independent dynamics of single objects accurately from their interactions. By employing latent Gaussian process ordinary differential equations, our model infers both independent dynamics and their interactions with reliable uncertainty estimates. In our formulation, each object is represented as a graph node and interactions are modeled by accumulating the messages coming from neighboring objects. We show that efficient inference of such a complex network of variables is possible with modern variational sparse Gaussian process inference techniques. We empirically demonstrate that our model improves the reliability of long-term predictions over neural network based alternatives and it successfully handles missing dynamic or static information. Furthermore, we observe that only our model can successfully encapsulate independent dynamics and interaction information in distinct functions and show the benefit from this disentanglement in extrapolation scenarios.
    History Compression via Language Models in Reinforcement Learning. (arXiv:2205.12258v1 [cs.LG])
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with original token embeddings. To form these associations, a modern Hopfield network stores the original token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v5 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    MAGMA: Inference and Prediction with Multi-Task Gaussian Processes. (arXiv:2007.10731v2 [stat.CO] UPDATED)
    A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyper-parameters optimisation and hyper-posterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multi-task GP models. Our overall algorithm is called \textsc{Magma} (standing for Multi tAsk Gaussian processes with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
    Quasi-Equivalence of Width and Depth of Neural Networks. (arXiv:2002.02515v7 [cs.LG] UPDATED)
    While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks in two aspects. First, we formulate two transforms for mapping an arbitrary ReLU network to a wide network and a deep network respectively for either regression or classification so that the essentially same capability of the original network can be implemented. Then, we replace the mainstream artificial neuron type with a quadratic counterpart, and utilize the factorization and continued fraction representations of the same polynomial function to construct a wide network and a deep network, respectively. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.
    Nonnegative Tensor Completion via Integer Optimization. (arXiv:2111.04580v2 [cs.LG] UPDATED)
    Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries.
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v2 [cs.LG] UPDATED)
    This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v3 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    Logarithmic regret bounds for continuous-time average-reward Markov decision processes. (arXiv:2205.11168v2 [cs.LG] UPDATED)
    We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
    Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits. (arXiv:2002.00287v3 [cs.LG] UPDATED)
    We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the $d$-dimensional contexts are generated i.i.d.~at random from a known distributions, we develop computationally efficient algorithms based on the classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches the best available bound for this problem. Our second algorithm, RobustLinExp3, is shown to be robust to misspecification, in that it achieves a regret bound of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true reward function is linear up to an additive nonlinear error uniformly bounded in absolute value by $\varepsilon$. To our knowledge, our performance guarantees constitute the very first results on this problem setting.
    Not too little, not too much: a theoretical analysis of graph (over)smoothing. (arXiv:2205.12156v1 [stat.ML])
    We analyze graph smoothing with \emph{mean aggregation}, where each node successively receives the average of the features of its neighbors. Indeed, it has quickly been observed that Graph Neural Networks (GNNs), which generally follow some variant of Message-Passing (MP) with repeated aggregation, may be subject to the \emph{oversmoothing} phenomenon: by performing too many rounds of MP, the node features tend to converge to a non-informative limit. In the case of mean aggregation, for connected graphs, the node features become constant across the whole graph. At the other end of the spectrum, it is intuitively obvious that \emph{some} MP rounds are necessary, but existing analyses do not exhibit both phenomena at once: beneficial ``finite'' smoothing and oversmoothing in the limit. In this paper, we consider simplified linear GNNs, and rigorously analyze two examples for which a finite number of mean aggregation steps provably improves the learning performance, before oversmoothing kicks in. We consider a latent space random graph model, where node features are partial observations of the latent variables and the graph contains pairwise relationships between them. We show that graph smoothing restores some of the lost information, up to a certain point, by two phenomenon: graph smoothing shrinks non-principal directions in the data faster than principal ones, which is useful for regression, and shrinks nodes within communities faster than they collapse together, which improves classification.
    Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning. (arXiv:2205.12184v1 [cs.LG])
    Continuous-time reinforcement learning offers an appealing formalism for describing control problems in which the passage of time is not naturally divided into discrete increments. Here we consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time, stochastic environment. Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, learning state representations, multiagent coordination, and more. We begin by establishing the distributional analogue of the Hamilton-Jacobi-Bellman (HJB) equation for It\^o diffusions and the broader class of Feller-Dynkin processes. We then specialize this equation to the setting in which the return distribution is approximated by $N$ uniformly-weighted particles, a common design choice in distributional algorithms. Our derivation highlights additional terms due to statistical diffusivity which arise from the proper handling of distributions in the continuous-time setting. Based on this, we propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm. We demonstrate the effectiveness of such an algorithm in a synthetic control problem.
    Stereographic Markov Chain Monte Carlo. (arXiv:2205.12112v1 [stat.CO])
    High dimensional distributions, especially those with heavy tails, are notoriously difficult for off-the-shelf MCMC samplers: the combination of unbounded state spaces, diminishing gradient information, and local moves, results in empirically observed "stickiness" and poor theoretical mixing properties -- lack of geometric ergodicity. In this paper, we introduce a new class of MCMC samplers that map the original high dimensional problem in Euclidean space onto a sphere and remedy these notorious mixing problems. In particular, we develop random-walk Metropolis type algorithms as well as versions of Bouncy Particle Sampler that are uniformly ergodic for a large class of light and heavy-tailed distributions and also empirically exhibit rapid convergence in high dimensions. In the best scenario, the proposed samplers can enjoy the ``blessings of dimensionality'' that the mixing time decreases with dimension.
    Dimension-agnostic inference using cross U-statistics. (arXiv:2011.05068v4 [math.ST] UPDATED)
    Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for a handful of classical problems including one-sample mean and covariance testing. Our tests are shown to have minimax rate-optimal power against appropriate local alternatives, and their power is optimal up to a $\sqrt 2$ factor. We end by suggesting some next steps for extending dimension-agnostic inference to other problems.
    Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold. (arXiv:2205.11677v1 [stat.ML])
    The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to stochastic model of networks and semidefinite program research.  ( 2 min )
    Eigenvalue and Generalized Eigenvalue Problems: Tutorial. (arXiv:1903.11240v2 [stat.ML] UPDATED)
    This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.  ( 2 min )
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v2 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions, we use Hamiltonian Monte Carlo sampling. We demonstrate our approach in a simulation study with hemodynamical models, which are time-dependent differential equations. Data are simulated from a more complex model than our modelling choice, and the aim is to learn physical parameters according to known mathematical connections. To demonstrate the flexibility of our approach, an example using the Heat equation, a space-time dependent differential equation where we consider a case of a biased data-acquisition process is also included. Finally, we fit the hemodynamic model using real data obtained in a medical trial.  ( 2 min )
    Risk-Sensitive Reinforcement Learning via Policy Gradient Search. (arXiv:1810.09126v3 [cs.LG] UPDATED)
    The objective in a traditional reinforcement learning (RL) problem is to find a policy that optimizes the expected value of a performance metric such as the infinite-horizon cumulative discounted or long-run average cost/reward. In practice, optimizing the expected value alone may not be satisfactory, in that it may be desirable to incorporate the notion of risk into the optimization problem formulation, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., exponential utility, variance, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, prospect theory and its later enhancement, cumulative prospect theory. In this book, we consider risk-sensitive RL in two settings: one where the goal is to find a policy that optimizes the usual expected value objective while ensuring that a risk constraint is satisfied, and the other where the risk measure is the objective. We survey some of the recent work in this area specifically where policy gradient search is the solution approach. In the first risk-sensitive RL setting, we cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for policy gradient-based risk-sensitive RL algorithms using a Lagrangian formulation. For the setting where risk is incorporated directly into the objective function, we consider an exponential utility formulation, cumulative prospect theory, and coherent risk measures. This non-exhaustive survey aims to give a flavor of the challenges involved in solving risk-sensitive RL problems using policy gradient methods, as well as outlining some potential future research directions.  ( 2 min )
    Optimality Conditions and Algorithms for Top-K Arm Identification. (arXiv:2205.12086v1 [stat.ML])
    We consider the top-k arm identification problem for multi-armed bandits with rewards belonging to a one-parameter canonical exponential family. The objective is to select the set of k arms with the highest mean rewards by sequential allocation of sampling efforts. We propose a unified optimal allocation problem that identifies the complexity measures of this problem under the fixed-confidence, fixed-budget settings, and the posterior convergence rate from the Bayesian perspective. We provide the first characterization of its optimality. We provide the first provably optimal algorithm in the fixed-confidence setting for k>1. We also propose an efficient heuristic algorithm for the top-k arm identification problem. Extensive numerical experiments demonstrate superior performance compare to existing methods in all three settings.  ( 2 min )
    DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction. (arXiv:2007.11972v4 [stat.ML] UPDATED)
    In spatial statistics, a common objective is to predict values of a spatial process at unobserved locations by exploiting spatial dependence. Kriging provides the best linear unbiased predictor using covariance functions and is often associated with Gaussian processes. However, when considering non-linear prediction for non-Gaussian and categorical data, the Kriging prediction is no longer optimal, and the associated variance is often overly optimistic. Although deep neural networks (DNNs) are widely used for general classification and prediction, they have not been studied thoroughly for data with spatial dependence. In this work, we propose a novel DNN structure for spatial prediction, where the spatial dependence is captured by adding an embedding layer of spatial coordinates with basis functions. We show in theory and simulation studies that the proposed DeepKriging method has a direct link to Kriging in the Gaussian case, and it has multiple advantages over Kriging for non-Gaussian and non-stationary data, i.e., it provides non-linear predictions and thus has smaller approximation errors, it does not require operations on covariance matrices and thus is scalable for large datasets, and with sufficiently many hidden neurons, it provides the optimal prediction in terms of model capacity. We further explore the possibility of quantifying prediction uncertainties based on density prediction without assuming any data distribution. Finally, we apply the method to predicting PM2.5 concentrations across the continental United States.  ( 2 min )
    Quantum Kerr Learning. (arXiv:2205.12004v1 [quant-ph])
    Quantum machine learning is a rapidly evolving area that could facilitate important applications for quantum computing and significantly impact data science. In our work, we argue that a single Kerr mode might provide some extra quantum enhancements when using quantum kernel methods based on various reasons from complexity theory and physics. Furthermore, we establish an experimental protocol, which we call \emph{quantum Kerr learning} based on circuit QED. A detailed study using the kernel method, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, shows quantum enhancements could happen in terms of the convergence time and the generalization error, while explicit protocols are also constructed for higher-dimensional input data.  ( 2 min )
    Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees. (arXiv:2205.11765v1 [cs.LG])
    We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark that our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server. The code for evaluation is provided in https://github.com/wanglun1996/secure-robust-federated-learning.  ( 2 min )
    Throwing Away Data Improves Worst-Class Error in Imbalanced Classification. (arXiv:2205.11672v1 [stat.ML])
    Class imbalances pervade classification problems, yet their treatment differs in theory and practice. On the one hand, learning theory instructs us that \emph{more data is better}, as sample size relates inversely to the average test error over the entire data distribution. On the other hand, practitioners have long developed a plethora of tricks to improve the performance of learning machines over imbalanced data. These include data reweighting and subsampling, synthetic construction of additional samples from minority classes, ensembling expensive one-versus all architectures, and tweaking classification losses and thresholds. All of these are efforts to minimize the worst-class error, which is often associated to the minority group in the training data, and finds additional motivation in the robustness, fairness, and out-of-distribution literatures. Here we take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data when fitted either on (i) the full training set, or (ii) a subset where the majority class is subsampled to match in size the minority class. We borrow tools from extreme value theory to show that, under distributions with certain tail properties, \emph{throwing away most data from the majority class leads to better worst-class error}.  ( 2 min )
    EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling. (arXiv:2205.12243v1 [stat.ML])
    This work presents strategies to learn an Energy-Based Model (EBM) according to the desired length of its MCMC sampling trajectories. MCMC trajectories of different lengths correspond to models with different purposes. Our experiments cover three different trajectory magnitudes and learning outcomes: 1) shortrun sampling for image generation; 2) midrun sampling for classifier-agnostic adversarial defense; and 3) longrun sampling for principled modeling of image probability densities. To achieve these outcomes, we introduce three novel methods of MCMC initialization for negative samples used in Maximum Likelihood (ML) learning. With standard network architectures and an unaltered ML objective, our MCMC initialization methods alone enable significant performance gains across the three applications that we investigate. Our results include state-of-the-art FID scores for unnormalized image densities on the CIFAR-10 and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among purification methods and the first EBM defense on ImageNet; and scalable techniques for learning valid probability densities. Code for this project can be found at https://github.com/point0bar1/ebm-life-cycle.  ( 2 min )
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v1 [stat.ML])
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalizations of a popular class of probabilistic models - the Variational Auto-Encoder (VAE). We point out the two generalization gaps that can affect the generalization ability of VAEs and show that the over-fitting phenomenon is usually dominated by the amortized inference network. Based on this observation we propose a new training objective, inspired by the classic wake-sleep algorithm, to improve the generalizations properties of amortized inference. We also demonstrate how it can improve generalization performance in the context of image modeling and lossless compression.  ( 2 min )
    Identifying Patient-Specific Root Causes of Disease. (arXiv:2205.11627v1 [stat.ML])
    Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI.  ( 2 min )
    Soft-SVM Regression For Binary Classification. (arXiv:2205.11735v1 [stat.ML])
    The binomial deviance and the SVM hinge loss functions are two of the most widely used loss functions in machine learning. While there are many similarities between them, they also have their own strengths when dealing with different types of data. In this work, we introduce a new exponential family based on a convex relaxation of the hinge loss function using softness and class-separation parameters. This new family, denoted Soft-SVM, allows us to prescribe a generalized linear model that effectively bridges between logistic regression and SVM classification. This new model is interpretable and avoids data separability issues, attaining good fitting and predictive performance by automatically adjusting for data label separability via the softness parameter. These results are confirmed empirically through simulations and case studies as we compare regularized logistic, SVM, and Soft-SVM regressions and conclude that the proposed model performs well in terms of both classification and prediction errors.  ( 2 min )
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v6 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures and topologies, and even outperforms supervised methods.  ( 2 min )
    A Quadrature Rule combining Control Variates and Adaptive Importance Sampling. (arXiv:2205.11890v1 [stat.ML])
    Driven by several successful applications such as in stochastic gradient descent or in Bayesian computation, control variates have become a major tool for Monte Carlo integration. However, standard methods do not allow the distribution of the particles to evolve during the algorithm, as is the case in sequential simulation methods. Within the standard adaptive importance sampling framework, a simple weighted least squares approach is proposed to improve the procedure with control variates. The procedure takes the form of a quadrature rule with adapted quadrature weights to reflect the information brought in by the control variates. The quadrature points and weights do not depend on the integrand, a computational advantage in case of multiple integrands. Moreover, the target density needs to be known only up to a multiplicative constant. Our main result is a non-asymptotic bound on the probabilistic error of the procedure. The bound proves that for improving the estimate's accuracy, the benefits from adaptive importance sampling and control variates can be combined. The good behavior of the method is illustrated empirically on synthetic examples and real-world data for Bayesian linear regression.  ( 2 min )
    Weakly-supervised Multi-output Regression via Correlated Gaussian Processes. (arXiv:2002.08412v2 [stat.ML] UPDATED)
    Multi-output regression seeks to borrow strength and leverage commonalities across different but related outputs in order to enhance learning and prediction accuracy. A fundamental assumption is that the output/group membership labels for all observations are known. This assumption is often violated in real applications. For instance, in healthcare datasets, sensitive attributes such as ethnicity are often missing or unreported. To this end, we introduce a weakly-supervised multi-output model based on dependent Gaussian processes. Our approach is able to leverage data without complete group labels or possibly only prior belief on group memberships to enhance accuracy across all outputs. Through intensive simulations and case studies on an Insulin, Testosterone and Bodyfat dataset, we show that our model excels in multi-output settings with missing labels, while being competitive in traditional fully labeled settings. We end by highlighting the possible use of our approach in fair inference and sequential decision-making.  ( 2 min )
    Advanced Manufacturing Configuration by Sample-efficient Batch Bayesian Optimization. (arXiv:2205.11827v1 [cs.LG])
    We propose a framework for the configuration and operation of expensive-to-evaluate advanced manufacturing methods, based on Bayesian optimization. The framework unifies a tailored acquisition function, a parallel acquisition procedure, and the integration of process information providing context to the optimization procedure. The novel acquisition function is demonstrated and analyzed on benchmark illustrative problems. We apply the optimization approach to atmospheric plasma spraying in simulation and experiments. Our results demonstrate that the proposed framework can efficiently find input parameters that produce the desired outcome and minimize the process cost.  ( 2 min )
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v1 [stat.ML])
    Most machine learning methods depend on the tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation, which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection method based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.  ( 2 min )
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v1 [cs.LG])
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\Theta$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.  ( 2 min )
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v1 [stat.ML])
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.  ( 2 min )

  • Open

    [D] What kind of motherboard do I need to run 2x 3090?
    Hey guys im a designer wanting to dip my toes into machine learning/Ai. 1- Is a mATX motherboard with 1x pciex16 and 1x pciex4 good enough for dual 3090? 2- do I absolutely need to use nvlink between these 3090s? what are the pros and cons? submitted by /u/Aeonbreak [link] [comments]  ( 1 min )
    [D] Google Speech to Text vs Building similar capability in house
    Hey All. I hope you're well. TL;DR I'm trying to compare the costs of building vs using existing technology For those of you who have experience building speech or signal processing ML / AI? I'm looking into building a SaaS feature that require speech to text. As an MVP, we've been using Google's speech to text API which is great - with a few problems - it's cost, and accuracy. While its accuracy is quite high, its cost, if I'm calculating correctly, relatively to our pricing, would be quite high. There are also benefits in terms of building accurate models for the specific industry we'd be using it. Does anyone have any examples of what it would take to build and operate something like that (costs / number of engineers) and more importantly operating (let's say - cost per minute to in computing power / storage) submitted by /u/Mobile_Jacket_894 [link] [comments]  ( 2 min )
    [D] Seeking advice for MLDS beginners
    I have an advanced degree in math, and am familiar with Python (from my college days). When I try tutorials for Sagemaker or Colab, it feels like I make progress, but spend 90% of my time trying to research why something is broken or not working. So, my question - is this the nature of this industry and software platforms at the moment? Are there any other tools/platforms that don't require constant troubleshooting for beginners? Thanks! submitted by /u/evergrowbro [link] [comments]  ( 1 min )
    [D] Training denoising diffusion model when forward process is unknown but we have multiple examples for each step?
    I hope the title at least kinda makes sense? More details below. ​ The way I understand denoising diffusion models on a high level is we simulate the forward process with a known distribution like gaussian noise and then learn the reverse process. I am wondering if it is reasonable to try training a diffusion model when we have a dataset from which we can sample any step of the forward process, but the actual forward process itself is unknown. Basically we have a training target for each "latent" in the diffusion process. ​ For the sake of simplicity we can think of our dataset as "images" and we can sample any image from the forward process. E.g we can sample the lowest quality images which are just random noise, medium quality images which contain the information we are interested in but obscured by noise or missing signal, or we can sample the highest quality images which is the final target of the model. We want a model which can map low/medium quality data to denoised high quality data; basically we are doing a super-resolution or image denoising task. We can generate any level of lower quality data we want, but the actual pixel-level difference between steps in the forward process is much more complex than just additive gaussian noise. ​ I already know it is *technically* feasible to train a diffusion model on this dataset since I am already testing it with varying levels of success. More what I am asking is if this makes any theoretical sense and where this might fit in with current literature on these models. Is it sufficient to just train an image-to-image model like a Unet to map each sample to the next in the markov chain since it might not be possible to evaluate KL divergence wrt the forward process? Or is there some way to approximate this as well? How is this any different from just training a stacked denoising autoencoder? ​ Thanks for any insight and/or discussion! submitted by /u/dimsycamore [link] [comments]  ( 2 min )
    [P]Made a community for freelancing in ML field
    Currently, freelancing in AI is still a niche and has a cold start problem no matter if you are freelancing yourself or being on a platform such as upwork, fivrr, etc. ​ Join the community and lets explore the alternative career path. https://www.reddit.com/r/ai_freelancing/ submitted by /u/meame2010 [link] [comments]  ( 1 min )
    [D] How does the accuracy of large language models compare to average humans? How does GPT-NeoX compare?
    I saw a graph comparing GPT-3, UnifiedQA and Gopher to Human Expert in level of accuracy on various fields. https://piped.kavin.rocks/watch?v=aPiHhJjN3hI&t=3m55s I was curious to know how does GPT-NeoX compare. And how do average humans perform? submitted by /u/ConsistentSense4760 [link] [comments]  ( 1 min )
    [P] Official Imagen Website by Google Brain
    https://imagen.research.google/ submitted by /u/margilly_ai [link] [comments]  ( 1 min )
    [Discussion] Is missing data still a problem?
    I'm curious to hear peoples' perspectives on this. As I understand it, the easiest and one of the most popular ways of "handling" missing data is to just throw away incomplete observations. Would spending time developing methods that can learn from incomplete observations still be useful? Or would you argue that, because there are so many settings in which data is so plentiful, it's perfectly fine at this point to just throw away incomplete observations? submitted by /u/vandelay_inds [link] [comments]  ( 1 min )
    [D] Visualizing loss surface in input space
    I am trying to recreate the surfaces from this paper "Interpreting Adversarial Robustness: A View from Decision Surface in Input Space" https://arxiv.org/abs/1810.00144 They briefly mention the reasoning for using input space instead of parameter space in section 2.2 but don't describe how they produce such surfaces in detail. I understand the conventional approach (http://arxiv.org/abs/1712.09913) of projecting into a 2D hyperplane with 2 random (normalized) vectors in parameter space, but struggle to think of a way to do this with input images (e.g. CIFAR10 dataset). I have also failed to find any implementations of this approach. Any advice? submitted by /u/sarfins [link] [comments]  ( 1 min )
    [P] Introducing BlindAI, an Open-source, fast and privacy-friendly AI deployment solution. Benefit from state-of-the-art AI without ever revealing your data!
    Hello everyone, We are pleased to introduce BlindAI to the AI community. BlindAI is an AI deployment solution, leveraging secure enclaves, to make remotely hosted AI models privacy friendly. Please have a look at our GitHub (https://github.com/mithril-security/blindai) to find out more! Motivation Today, most AI tools offer no privacy by design mechanisms, so when data is sent to be analysed by third parties, the data is exposed to malicious usage or potential leakage. We illustrate it below with the use of AI for voice assistants. Audio recordings are often sent to the Cloud to be analysed, leaving conversations exposed to leaks and uncontrolled usage without users’ knowledge or consent. ​ Before and after BlindAI By using BlindAI, data remains always protected as it is only decrypted inside a Trusted Execution Environment, called an enclave, whose contents are protected by hardware. While data is in clear inside the enclave, it is inaccessible to the outside thanks to isolation and memory encryption. This way, data can be processed, enriched, and analysed by AI, without exposing it to external parties. What you can do We have been able to run several state of the art models with privacy guarantees, enabling us to tackle complex scenarios, from privacy-friendly voice assistant with Wav2vec2, to confidential chest X-Ray analysis with ResNet, through document analysis with BERT. All of these models have been tested and can run with end-to-end protection under a second on an Intel(R) Xeon(R) Platinum 8370C. Model name Example use case Inference time (ms) DistilBERT Sentiment analysis 28.435 Wav2vec2 Speech to text 617.04 Facenet Facial recognition 47.135 A more detailed list of models we can deploy with privacy, with their run time, can be found here. If you like it drop a ⭐on our GitHub (https://github.com/mithril-security/blindai)! submitted by /u/Separate-Still3770 [link] [comments]  ( 1 min )
    [D] Best GP book
    Hi! As part of my research I’m using Gaussian Process (regression) but want to understand GPS more fully. Does anyone have any book recommendations that explain them well, either whole books on them or part of wider machine learning textbooks. Thanks submitted by /u/Ikeyt [link] [comments]  ( 1 min )
    [D] Deep Sets and Attention, what's the difference?
    I have stumbled across the relevant literature on neural networks for sets, e.g Deep Sets, PointerNet, and subsequent works. These architectures look similar to attention mechanisms and related models (e.g. Transformers), but they are treated as a very separated thing in the literature (e.g. in the Deep Sets paper the word "attention" never appears). So, I wonder, what is the main conceptual difference between Deep Sets and attention? I'm not very experienced with attention mechanisms, so I might be missing something in the big picture here. submitted by /u/fedetask [link] [comments]  ( 1 min )
    [D] Recent research and methods for time series forecasting
    Recent advances in Vision and NLP are dominating the AI community at the moment. I have been trying to find out if something exciting has been done for time series forecasting recently (last five years or so). Looking for some good starting points to keep up with the latest research. Any pointers would be highly appreciated. submitted by /u/ndalal01 [link] [comments]  ( 1 min )
    [D] What are some of the better one shot learning techniques for time series data?
    I am working on a project involving one shot learning for time series data from dynamical systems. Which would be some of the better ML models, I should look at for this purpose? submitted by /u/_hereforthecomments [link] [comments]  ( 1 min )
    [P] How to quantify the similarity/dissimilarity between two time series datasets?
    Hey guys, I was working with time series dataset for dynamical systems. I need to quantify how similar the datasets from two different dynamical systems are. So what would be the optimal parameter to do so? submitted by /u/_hereforthecomments [link] [comments]  ( 1 min )
    [D] finding the paper I want in the CVPR 2022.
    Hello guys, When I was finding for papers in the conference, object detection, I was surprised at the fact that there were above of 2000 papers. I usually read papers through a section of abstract, but this time can I only just open the papers to check out that my interest topics is there? Plz let me know. If I have to do that, I will. Thx. :) submitted by /u/Mundane_Definition_8 [link] [comments]  ( 1 min )
    [P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
    TL;DR We made autoregressive transformer based models like T5-large 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks: storing 2 computation graphs in a single Onnx file 👯: this let us have both cache and no cache support without having any duplicated weights. When cache is used, attention switch from quadratic to linear complexity (less GPU computation) and Onnx Runtime brings us kernel fusion (less memory bound ops); zero copy 💥 to retrieve output from Onnx Runtime: we leverage Cupy API to access Onnx Runtime internal CUDA arrays and expose them through Dlpack to Pytorch. It may sound a bit complex, but it let us avoid output tensors copy which limit our memory footprint and make us much faster (check notebook for other benefits of this approach); a generic tool to conv…  ( 4 min )
    [D] Calculating Shannon Information of Data Augmentation Strategies
    I recently caught Andrew Ng's 2021 talk on MLOps (MLOps: From Model-centric to Data-centric AI). At 26:40, he talks about calculating the effectiveness of cleaning your data (training examples) vs. collecting new examples. Apparently this is possible to calculate using Shannon Information. As an example, Ng notes that "[if you have] 500 pictures of iguanas and if 12% of the labels are noisy (incorrectly labelled), then from a Shannon Information calculation, you can show that [cleaning up the noisy data and collecting another 500 new examples] are about equally effective.". Does anyone know how exactly this might be calculated? I've tried to find more information on this but without much luck. Intuitively, it seems related to Shannon's Limit but the exact application is unclear to me. Thanks. Edit: Cross-posted to Cross Validated submitted by /u/Academy- [link] [comments]  ( 3 min )
    [D] What are the environmental effects of using a pre-trained ML model powered by a GPU?
    The Internet has a lot of information concerning the environmental effects of training an ML model with GPUs. But less is available in terms of the emissions generated by running a trained model on a GPU. For example, Google has a language model called BERT. If I was to deploy it on a web server powered by a GPU, would the emissions be close to nil or would it actually harm the environment? submitted by /u/Alternative-Pause-14 [link] [comments]  ( 1 min )
    [D] ANN architecture for decoupling Signals
    I have read multiple papers and they never attach their source code like here https://www.science.org/doi/10.1126/scirobotics.abc6878 My problem is I have no idea how to apply ANN for decoupling different measured signals. I think I am not really capable to digest what these papers did, I would appreciate if someone can help me in understanding how to implement these algorithms from scratch. ​ Thanks submitted by /u/meldiwin [link] [comments]  ( 1 min )
  • Open

    DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner?
    It is hard, looking at the current technological landscape, to believe that artificial intelligence may actually be facing a reckoning that could cause investment in the field to dry up. If you go by the press releases and even the many products that supposedly incorporate artificial-intelligence-oriented products, it would seem that computers capable of thinking… Read More »DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner? The post DSC Weekly 24 May 2022: Is an AI Autumn Around the Corner? appeared first on Data Science Central.  ( 6 min )
    Top Ways in Which Data Science Improves E-Commerce Sales
    Data science refers to the use of algorithms, systems, technology, etc. to glean insights from data of all kinds. And, Machine learning (ML) and artificial intelligence (AI) make it possible for the shoppers with predictions based on what they like even before they decide to look for a specific product offering. Data science has been… Read More »Top Ways in Which Data Science Improves E-Commerce Sales The post Top Ways in Which Data Science Improves E-Commerce Sales appeared first on Data Science Central.  ( 3 min )
    Why is the Gig Economy a New Future?
    The expression “Gig Economy” alludes to an unrestricted economy framework where project work, term agreements, and brief positions are pervasive. The associations will enlist autonomous consultants or experts for momentary responsibilities. The expression “gig” is a shoptalk word for a task that endures a predefined period that was first made well known by artists or… Read More »Why is the Gig Economy a New Future? The post Why is the Gig Economy a New Future? appeared first on Data Science Central.  ( 4 min )
    Lessons Learned from Writing My First Python Script
    After 25 Years of Coding in C And Perl. As an independent author/researcher, there is of course nothing in my “job description” that says I should code in Python (or any other language). Yet for a long time, I thought coding in Python would help me a lot. It would mean more readers, and thus… Read More »Lessons Learned from Writing My First Python Script The post Lessons Learned from Writing My First Python Script appeared first on Data Science Central.  ( 6 min )
    Russian Troll Detection By Their Tweets
    This project for me was personal. I experienced the propaganda machine of the Soviet Union and am horrified to see it used on Americans. As a young adult in Soviet Russia, I succumbed to brainwashing and had no idea what was really going on. “Everybody always lies” had been the norm. I came to the… Read More »Russian Troll Detection By Their Tweets The post Russian Troll Detection By Their Tweets appeared first on Data Science Central.  ( 4 min )
    Importance of Fearlessness to Exploit the Potential of AI
    “Courage is not the absence of fear, but rather the judgment that something else is more important than fear.  The brave may not live forever, but the cautious do not live at all”– Philippe Renaldi, The Princess Diaries (2001) Optimizing versus innovating…the simple difference between surviving versus thriving. Yes, it is always easier to apply… Read More »Importance of Fearlessness to Exploit the Potential of AI The post Importance of Fearlessness to Exploit the Potential of AI appeared first on Data Science Central.  ( 7 min )
  • Open

    What is the current SOTA for Online RL?
    Hi everyone! Been reading a lot about RL and was wondering what the current SOTA models are for online RL.I have noticed that more and more models are starting to use transformers. Is this also where SOTA online RL models are going? Any ideas / oppinions about what the future direction of this field is would also be highly appreciated. Cheers submitted by /u/Idonai [link] [comments]  ( 1 min )
    Do you know any environments in which both the direction and the position of the agent are tracked?
    In the simple spread env, only position is tracked (https://github.com/openai/multiagent-particle-envs/blob/47e9ee38e605f8a563370b3c7e52a349eca3f6b1/multiagent/scenarios/simple_spread.py#L40) submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    How to make game based RL research paper worthy?
    I am completely new to RL, I am experienced with NN. I decided to do my thesis on RL to take it as learning opportunity but Im beginning to find it quite intimidating to create a sim environment and apply RL given Im inexperieced with RL and have little time provided for thesis. Alternatively, I am suggested games which is also fun to work and to learn but Im struggling to comeup with a thesis statement such that I can use for publication as well. Some games I have considered are Go, Bomberland(Bomberman knock off from coderone), Kore (https://www.kaggle.com/competitions/kore-2022). I am also open to otherwise suggestions. submitted by /u/npc1111 [link] [comments]  ( 1 min )
    Exploration strategy for aerial robotics.
    In the continuous action tasks where a slight exploration can impact the environment drastically (like aerial robots), what exploration strategy can be used to safely explore the action space. I am currently using DDPG algorithm for position tracking in drones. I use Gaussian noise for exploration. The rewards for this tracking task are sparse as large negative rewards can make training unstable. So what can I use as the exploration strategy in this scenario. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    Is DQN capable of 'solving' random dungeon traversal of unknown length and start/end positions?
    I'm interested in implementing DQN for a dungeon crawler I play. You are given a 2d map with your position as the central point and you need to traverse to the next zone, the map is limited in scope and is slowly revealed as you move along. There is a map based marker for the entrance to the next zone. ​ Since it is a dungeon of random size and random end/start positions, with no method to generate a reward until the agent gets to the next zone (ie the max overall reward is 1) is it possible for the agent to learn a policy in this scenario? submitted by /u/IFartedAndMyDickHurt [link] [comments]  ( 2 min )
    more thread, train crash
    In one computer, use ppo on swimmer. with 16 threads, the agent trains crash after 1 hour. with less than 16 threads, the agent trains smoothly and well https://preview.redd.it/fivxnkg11c191.png?width=727&format=png&auto=webp&s=4047c1b516d514619f0b73e38efc38d04bab5135 submitted by /u/OkSkirt5714 [link] [comments]  ( 1 min )
  • Open

    What is Real in a World of Social Constructs
    In today’s episode of Future Tech. I ask the question, what is real? Because in this world of social constructs, everything is created by… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
    Drupal Commerce: Why is it Ideal for Your E-Commerce Business
    The world of e-commerce is undeniably among the most lucrative sectors in the world at the moment and is expected to remain so for the…  ( 2 min )
  • Open

    The hype around DeepMind’s new AI model misses what’s actually cool about it
    submitted by /u/KelliaMcclure [link] [comments]  ( 1 min )
    In this tutorial, we show how to automate job search using fine-tuned NER model. Checkout the article to learn more!
    submitted by /u/UBIAI [link] [comments]
    Questionnaire for my dissertation on the subject of A.I
    Hey guys! I'm a student and I'm currently working on my dissertation for University. I'm using this as a way of collecting data on the representation of AI in movies and pop culture and I'd appreciate the responses! Here's the link: https://forms.gle/1jrzrfuSd3rFD6A17 submitted by /u/ZakUllah [link] [comments]
    Working on cognitive control in an artificial cognitive entity - when a machine can think about anything, how do you control what it thinks about and why?
    submitted by /u/DavidKShapiro [link] [comments]  ( 1 min )
    Saiyan transformation
    submitted by /u/Due-Ad9795 [link] [comments]
    MELODIES POSITIVE: Flowers from coconut cap.
    submitted by /u/cookingandcraft [link] [comments]
    6 Best Artificial Intelligence courses for Healthcare You should learn 2022
    submitted by /u/maneesh123456 [link] [comments]
    Unsupervised Tokenization Learning
    submitted by /u/akolonin [link] [comments]  ( 1 min )
    I can't wait for Dall E 2, but this will do for now!
    submitted by /u/BeginningRealistic49 [link] [comments]
    Google Brain's new model Imagen is incredible!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Fast.ai vs statistics.com?
    I’m currently doing a math and cs bachelors. I want to move into AI research (Something that is applicable currently but also contributes to the AGI field ideally) in the near future. Which site/resources should I use to learn relevant knowledge? Are there better ones? submitted by /u/BackgroundSense351 [link] [comments]  ( 1 min )
  • Open

    Image-Text Pre-training with Contrastive Captioners
    Posted by Zirui Wang and Jiahui Yu, Research Scientists, Google Research, Brain Team Oftentimes, machine learning (ML) model developers begin their design using a generic backbone model that is trained at scale and with capabilities transferable to a wide range of downstream tasks. In natural language processing, a number of popular backbone models, including BERT, T5, GPT-3 (sometimes also referred to as “foundation models”), are pre-trained on web-scale data and have demonstrated generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with training over-specialized individual models, pre-training backbone models for a large number of downstream tasks can amortize the training costs, allowing one to overcome resource limitations when building large s…  ( 7 min )
  • Open

    Powering Next Generation Applications with OpenAI Codex
    Codex is now powering 70 different applications across a variety of use cases through the OpenAI API.  ( 3 min )
  • Open

    Is there any open public API like thispersondoesnotexist.com where I can specify gender and age?
    submitted by /u/AxelTheRabbit [link] [comments]
    How to fix cardinality error in my CNN model
    I am currently working on a CNN model where my inputs are CSV's with 1080 rows and 4 attributes(collumns). I have all the CSVs under their own categories directory, for example /Category A/..., /Category B/... etc. At start I have created two arrays: X = [] Y = [] and then in a for loop going thru all the directories I read the contents of the CSV in a `1080x4` shape and put it in my X array like this: (I have confirmed the values are correctly read and in correct shape) X.append(pandas.read_csv(itemPath).values) and I add the category of the item in my Y array right after, so the order of `X` items are alligned with order of `Y` items (categories of X values). Y.append(cat) Then here is my model, albeit I got it mostly from a Kaggle example, I just wanted to see if it works …  ( 4 min )
  • Open

    NVIDIA Brings Data Center, Robotics, Gaming, Content Creation Innovations to COMPUTEX
    Digital twins that revolutionize the way the most complex products are produced. Silicon and software that transforms data centers into AI factories. Gaming advances that bring the world’s most popular games to life. Taiwan has become the engine that brings the latest innovations to the world. So it only makes sense that NVIDIA leaders brought Read article > The post NVIDIA Brings Data Center, Robotics, Gaming, Content Creation Innovations to COMPUTEX appeared first on NVIDIA Blog.  ( 7 min )
    NVIDIA Adds Liquid-Cooled GPUs for Sustainable, Efficient Computing
    In the worldwide effort to halt climate change, Zac Smith is part of a growing movement to build data centers that deliver both high performance and energy efficiency. He’s head of edge infrastructure at Equinix, a global service provider that manages more than 240 data centers and is committed to becoming the first in its Read article > The post NVIDIA Adds Liquid-Cooled GPUs for Sustainable, Efficient Computing appeared first on NVIDIA Blog.  ( 4 min )
    NVIDIA Partners Announce Wave of New Jetson AGX Orin Servers and Appliances at COMPUTEX
    More than 30 leading technology partners worldwide announced this week the first wave of NVIDIA Jetson AGX Orin-powered production systems at COMPUTEX in Taipei. New products are coming from a dozen Taiwan-based camera, sensor and hardware providers for use in edge AI, AIoT, robotics and embedded applications. Available worldwide since GTC in March, the NVIDIA Read article > The post NVIDIA Partners Announce Wave of New Jetson AGX Orin Servers and Appliances at COMPUTEX appeared first on NVIDIA Blog.  ( 3 min )
    Master of Arts: NVIDIA RTX GPUs Accelerate Creative Ecosystems, Delivering Unmatched AI and Ray-Tracing Performance
    The future of content creation was on full display during the virtual NVIDIA keynote at COMPUTEX 2022, as the NVIDIA Studio platform expands with new Studio laptops and RTX-powered AI apps — all backed by the May Studio Driver released today. The post Master of Arts: NVIDIA RTX GPUs Accelerate Creative Ecosystems, Delivering Unmatched AI and Ray-Tracing Performance appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    President Guðni Thorlacius Jóhannesson of Iceland visits MIT
    Delegation meets campus leaders, with an eye toward AI applications and the Icelandic language.  ( 6 min )

  • Open

    Our own stupidity might be an advantage for AI
    We are just smart enough to deal with problems logically, but too stupid to understand certain information hazards. Our "just good enough" understanding could be used where AI can't go submitted by /u/burner557799 [link] [comments]
    What are the opereting systems with the most ai technology build-in right now?
    submitted by /u/Scared_Assistance_28 [link] [comments]
    New Qualcomm RB6 AI Robot Cloud Accelerator Has Quad Core 2.85Ghz @ 200 Trillion Operations Per Second | Breakthrough Neural Network TPU Architecture
    submitted by /u/SlightSituation [link] [comments]  ( 1 min )
    When AI Meets the Transgender Community
    submitted by /u/punkthesystem [link] [comments]  ( 1 min )
    Meta's MyoSuite — An embodied AI platform that unifies neural and motor intelligence
    submitted by /u/SpatialComputing [link] [comments]  ( 1 min )
    Stanford And Oxford Researchers Propose An Approach To Relate Transformers To Models And Neural Representations Of The Hippocampal Formation
    In recent years, a significant part of neuroscience research has focused on relating deep learning architectures to the human brain, and many deep learning (DL) techniques have recently been shown to replicate neural firing patterns observed in the brain. For example, representations of convolutional neural networks have been shown to predict neurons in the visual cortex and inferior temporal cortex, while recurrent neural networks have been shown to recapitulate grid cells in the medial entorhinal cortex. The ability to use machine learning models to predict brain representations allows for a deeper understanding of the mechanistic computations of the respective brain areas and a deeper understanding of the nature of the models. However, one of the most exciting and promising new architec…  ( 2 min )
    Spring (A. I. animation + sound design)
    submitted by /u/nenomancer [link] [comments]  ( 1 min )
    Frame interpolation that allows for multiple sources?
    I'm working with material that was originally mastered in 29.97fps, then converted to 25fps for PAL territories. The PAL source is the best looking copy, in terms of detail, fewer digital artifacts, resolution, and colour - but it's obviously missing frames. Rather than just using AI to create new frames, is there a tool that will allow me to introduce multiple sources so the AI knows what those missing frames should best look like? submitted by /u/Plebsolute [link] [comments]  ( 1 min )
    Interesting AI
    Hello, I plan on giving a TEDx talk on AI because I love AI and I have spent a lot of time learning AI, Machine Learning, and Deep Learning (with all the math). I thought of talking about interesting AI that currently exist but there is many so I can't decide. I would love to hear your guys' ideas on what is your favourite and the most interesting AI you have seen which I could talk about. All responses are appreciated Thanks submitted by /u/No-Conversation8169 [link] [comments]  ( 1 min )
    Superhuman AI - Artificial intelligence predicts patients’ race from their medical images
    submitted by /u/qptbook [link] [comments]
    Machine learning has a backdoor problem
    submitted by /u/bendee983 [link] [comments]
    The StatQuest Illustrated Guide To Machine Learning eBook
    The StatQuest Illustrated Guide To Machine Learning submitted by /u/Futureisnotsecure [link] [comments]
    In The Latest AI Research, CMU And Adobe Researchers Propose An Elegant Emsembling Mechanism For GAN Training That Improves FID by 1.5x to 2x On The Given Dataset
    Image generation necessitates the ability to capture and model complicated statistics in real-world visual events. When trained on large-scale data, computer vision models have shown adept at capturing valuable representations, thanks to the effectiveness of supervised and self-supervised learning techniques. Surprisingly, despite the aforementioned link between synthesis and analysis, state-of-the-art generative adversarial networks (GANs) are trained without the use of such pre-trained networks in an unsupervised way. This is a squandered opportunity to investigate, given the abundance of relevant models readily available in the research ecosystem. Adobe researchers investigated the usage of a collection of pretrained deep feature extractors to aid in generative model training in a recent publication. GANs are specifically taught with a discriminator and a generator, which are both targeted at continuously learning the relevant statistics that distinguish genuinely and produced data. Continue Reading Paper: https://arxiv.org/pdf/2112.09130.pdf Github: https://github.com/nupurkmr9/vision-aided-gan Project: https://www.cs.cmu.edu/\~vision-aided-gan/ https://i.redd.it/y33lpsbpg6191.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Flower pot.
    submitted by /u/cookingandcraft [link] [comments]
    15 Machine Learning Project (End to End)
    Hi Guys, Free tutorial on Machine Learning Project (End to End) in Apache Spark and Scala with Code and Explanation Machine Learning Pipeline Application on Power Plant. Build Movies Recommendation Engine Sales Prediction or Sale Forecast Mushroom Classification whether it’s edible or poisonous Predict Forest Cover Predict Will it Rain Tomorrow in Australia Customer Segmentation using Machine Learning Predict Ads Click (93% Accuracy) Prediction task is to determine whether a person makes over 50K a year Classifying gender based on personal preferences Mobile Price Classification Predicting the Cellular Localization Sites of Proteins in Yest YouTube Spam Comment Prediction Identify the Type of animal (7 Types) based on the available attributes Glass Identification Predicting the age of abalone from physical measurements I hope you'll enjoy these tutorials. submitted by /u/bigdataengineer4life [link] [comments]  ( 1 min )
    You are water-based, not silicon-based, life. Be like water.
    The late Terence McKenna, in one of his many talks, pointed out the two choices on the menu when it comes to why we are here. One is the Christian belief that God created the world in six days. The other one that a Big Bang happened some billions years ago and the universe sprang out of nothingness - as a random happening. His point was this: Both of these explanations are utterly improbable. Still if you hold the Christian belief, then you do not question it and hold it as "the truth". Likewise if you believe in the Big Bang explanation you hold that as truth. From other cultures we learn that the world actually resides on the back of a giant turtle. Sounds crazy right? .. except if you are brought up with that belief. Could the important thing be WHO YOU BECOME by believing a given…  ( 2 min )
  • Open

    [P] Imagen: Latest text-to-image generation model from Google Brain!
    Imagen - unprecedented photorealism × deep level of language understanding Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Human raters prefer Imagen over other models (such as DALL-E 2) in side-by-side comparisons, both in terms of sample quality and image-text alignment. https://gweb-research-imagen.appspot.com/ https://gweb-research-imagen.appspot.com/paper.pdf submitted by /u/aifordummies [link] [comments]  ( 1 min )
    [D] ECCV 2022 Reviews
    Now ECCV 2022 reviews are out, what is your general feeling about the quality of reviews? submitted by /u/margilly_ai [link] [comments]  ( 1 min )
    [Discussion] Natural Language as an intermediate link for solving not-NLP machine learning problems
    Hello. Recently I've stumbled upon the following RL paper: https://arxiv.org/abs/2202.08938 , and it made me super curious. In their experiments, an agent acts in the setup of some simple video game, and they use textual descriptions of the environment as an additional information for this agent, to help it to perform a novelty search more efficiently, because textual descriptions express real novelty of a new environment more meaningfully than raw data. As they write: "we explore natural language as a general medium for highlighting relevant abstractions in an environment". After reading it, I started to be curious, whether there are any other papers, where Natural Language descriptions are used as an intermediate tool to help Machine Learning model to perform some not-NLP task? Maybe someone even tried to teach NLP model to generate the most useful hints for not-NLP model to solve not-NLP tasks? I would be grateful to learn any info about such research direction. Thanks! submitted by /u/KushnarevaL [link] [comments]  ( 1 min )
    [D]I want to compare train of a CNN on a dataset A vs pretraining on another bigger dataset B + finetuning dataset on dataset A. Should I use the same learning rate for both experiments, or use a diminished learning rate for finetuning?
    I have read that it is a common place to use a reduced learning rate when finetuning on a new dataset for transfer learning. On the other hand it seems to me not a good practice to do that and compare with the simple training that was performed with a larger training rate. submitted by /u/TheManveru [link] [comments]  ( 1 min )
    [D] Scaling sentence embeddings for PCA
    I’m doing PCA for different sentence embeddings (word2vec, BERTtweet, InferSent…) of my data. My question is, should I scale these embeddings before putting them into PCA. I know it’s a standard practice in ML when using PCA, but idk if it still stands for sentence embeddings. Also if I should, will standard scaler be ok? submitted by /u/No_Technology1455 [link] [comments]  ( 1 min )
    [D] What's the intuition behind certain CNN architectures?
    My knowledge about CNNs is relatively basic. I know what they do, but I have not enough experience to understand the different choices in architectures (besides the obvious improvements, like a residual block). For what I do I never needed to rely on CNNs, so whenever I worked with them I just used some existing CNN structures, but now I would like to learn more about it, so that I can optimize my current models. So for example looking at some CNNs used in reinforcement learning, here's one in particular (the CNN from the Atari self-learning Nature article): self.base = nn.Sequential( init_(nn.Conv2d(4, 32, kernel_size=8, stride=4, padding=0)), nn.ReLU(), init_(nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0)), nn.ReLU(), init_(nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=0)), nn.ReLU(), Flatten(), init_(nn.Linear(32 * 7 * 7, outputs)), nn.ReLU() ) The input is a stack of 4 frames with 84x84 grayscale pixels. I understand that you'd want to divide this image into many different smaller fields, but what's the intuition of doing it in such a way, e.g. 3 layers of CNNs? Why not 5? Why not 2 or 1, but with more outputs? In fact, it seems to me that a parallel approach with different parameters, instead of a sequential approach would be superior, since it seems to me that information would get lost after the first layer that uses a kernel size of 8. Are there any resources that you could recommend and that delve deeply into the nuances when creating CNN architectures? Thank you submitted by /u/NikEy [link] [comments]  ( 7 min )
    AI book reading platform using machine learning [P]
    Greetings Folks, In the past year, we had released a book reading AI tool to search for content within files using natural search, and we had received constructive feedback from the machine learning community. We are releasing, a new updated version with a fresh UI overhaul (desktop support) https://rastero.io/books/explore This project utilizes the sentence transformers library from UKP labs found here. Glad to share it with you all. Here's an account to try it out! username: reddit password: reddit2022 A demo of the interface to search using semantic similarity based search . submitted by /u/deep_ak [link] [comments]  ( 1 min )
    [Discussion] What are frameworks used for Human-In-The-Loop (Active) learning ?
    We have a new request from our product team : "Develop a human in the loop binary classifier for a fashion application. The classifier should learn the themes of photos and classify the future data with high degree of confidence. " The human in the loop will be an annotator (or) expert whom we can query to label the examples (of mini-batches). The number of queries is limited by confidence score and budget. Are there any tools/frameworks for building such (active learning) applications ? submitted by /u/UncertainLangur [link] [comments]  ( 1 min )
    [D] Anyone waiting for ECCV reviews?
    They should come out today, but I guess they are delayed. submitted by /u/SeucheAchat9115 [link] [comments]
    [P] CTranslate2: an efficient inference engine for Transformer models
    Hi! I'd like to share this project I've been working on for almost 4 years: https://github.com/OpenNMT/CTranslate2 CTranslate2 is a C++ and Python library for efficient inference with Transformer models. While the project initially focused on translation models (hence the name), it also supports autoregressive language models such as GPT-2 and the recent OPT models from Meta. The library comes with a highly optimized runtime that implements various performance optimization techniques such as weight quantization, layer fusion, batch reordering, padding removal, etc. (Check out these benchmark tables for a comparison with other frameworks on a translation task.) We provide model converters for multiple frameworks: OpenNMT Fairseq Marian Hugging Face's Transformers You can also add you own converter if the model architecture is supported. We currently support selected variants of encoder-decoder and decoder-only Transformer models (including pre-norm and post-norm). If you'd like to learn more, please visit the GitHub repository and feel free to post questions or suggestions about the project! Thanks! submitted by /u/guillaumekln [link] [comments]  ( 1 min )
    [R] Self-Net: Lifelong Learning Via Continual Self-Modeling
    submitted by /u/EducationalCicada [link] [comments]
    [N] ShiftHappens Workshop @ICML 2022 welcoming submissions & AMA
    I'm one of the organizers of the ShiftHappens Workshop hosted at ICML 2022. The goal of the workshop is to create a community-built robustness benchmark, incorporating new challenging datasets and tasks. Submissions can be ImageNet-scale datasets or evaluation tasks that help finding potentially unforeseen behaviours of tested models. All accepted contributions will be part of the benchmark and authors can become co-authors of a summary paper. To make it easy to contribute, we now also accept submissions in the form of extended abstracts, where an interesting idea for a dataset or task is explained in a one-page paper format, see also our latest tweet! We welcome datasets and tasks published at previous conferences as well as novel and work-in-progress papers. So if you e.g. evaluated your training method on a new task highlighting its advantages, we would be happy to receive that task as a submission :) If you don't, share this with friends and colleagues that might ;) Also check out our strong line-up of invited speakers which you might not want to miss if you are at the ICML conference this year! I and some of the other organizers are around to answer any questions about submitting, data shift and the workshop in general. submitted by /u/JBitterwolf [link] [comments]  ( 1 min )
    [R] Nothing makes sense in deep learning, except in the light of evolution
    submitted by /u/DevFRus [link] [comments]  ( 1 min )
    [D] - Is it safe to force stop, save weights and later resume a Tensorflow model during training?
    So I am working with a Unet (CNN) architecture with an Adam optimizer. Occasionally I have forced stopped the model, saved the weights, then resumed the training later on. When I resume, the loss plot seems to look less smooth compared to what it did earlier, but the loss also seems to have decreased significantly. Is it generally safe to stop and resume model training in this way? Also, I'm not sure how to interpret the decreased but less smooth loss function. submitted by /u/suoarski [link] [comments]  ( 2 min )
    ICML 2022 papers with affiliations [D]
    https://icml.cc/Conferences/2022/AcceptedPapersInitial On a quick ctrl-F : Tsinghua (157) overtakes Stanford (139). In 2021, these values were Tsinghua (71) and Stanford (140). http://web.archive.org/web/20210618094315/https://icml.cc/Conferences/2021/AcceptedPapersInitial ^ 2021 for comparison submitted by /u/Simping4Kaiming [link] [comments]
    [D] ML library (ideally for R) with a good software design?
    Hi! As a Machine Learning Engineer, I was studying the design patterns behind scikit-learn's API (you can see here and here) and I was wondering if any of you know of something similar but for R that I can check. Note: I am asking about R because that's what I am using and it is difficult to find something functional programming oriented for other languages, but any other library you find interesting is welcome! submitted by /u/Silver_Book_938 [link] [comments]  ( 2 min )
    [D] How to obtain frequencies of English word phrases?
    I know one way to calculate frequencies of English word phrases such as "it's raining today" is to simply search for that phrase in a large text corpus, but would anyone happen to know of any simpler/newer ways to do that? I'm wondering if it's somehow possible to calculate the frequencies of word phrases using a large language model such as GPT-3. submitted by /u/Specialist_Art_8502 [link] [comments]  ( 1 min )
  • Open

    Breakthrough Stashing Algorithm & Neural Network TPU Architecture
    submitted by /u/getrich_or_diemining [link] [comments]
    How I (Kinda) Created an A.I. Generated Cartoon
    submitted by /u/BasicallyJustASpider [link] [comments]  ( 1 min )
    The StatQuest Illustrated Guide To Machine Learning eBook
    submitted by /u/Futureisnotsecure [link] [comments]  ( 1 min )
  • Open

    Voiced and unvoiced consonants and digits
    The latest episode of The History of English Podcast discusses the history of pronunciation changes in the Elizabethan period. The episode has a lot to say about the connections between voiced and unvoiced pairs of consonants, and the circumstances under which a consonant might change from voiced to unvoiced and vice versa. The major mnemonic […] Voiced and unvoiced consonants and digits first appeared on John D. Cook.  ( 2 min )
    Year share
    This post will be about psychology as much as math, looking at a number of algorithms for mentally calculating the same function. The most difficult part of mentally computing days of the week is computing ⌊5y/4⌋ % 7 where y is the last two digits of a year. This quantity is called the year share […] Year share first appeared on John D. Cook.  ( 3 min )
  • Open

    What are some recent or timeless interesting papers on approaches to Dynamic Pricing in RL? I'm thinking uber surcharging or limited supply queue length dependant pricing policies to maximize revenue.
    submitted by /u/n3ver_summer [link] [comments]  ( 1 min )
    Is it possible to use the MuJoCo Gym environments with the new Python binding ?
    It appears Gym uses the old python integration "mujoco-py" rather than the new official one (https://mujoco.readthedocs.io/en/latest/python.html). Is it possible to use the gym environments with the new Python binding ? If not will Gym be updated to support the new official python binding ? submitted by /u/arckonte [link] [comments]  ( 1 min )
    7 Best Keras Online Courses for Deep Learning
    submitted by /u/MlTut [link] [comments]
    Is semi-gradient TD(lambda) + experience replay make sense?
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    How HRs Can Use Recruitment Data Analytics for Better Hiring
    Nearly every forward-thinking organization uses analytics in recruitment to bring efficiency to its hiring process. A significant…  ( 3 min )
  • Open

    (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools
    It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.   Content moderation tools can be deployed to […] The post (De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools appeared first on Microsoft Research.  ( 10 min )
    Partnering people with large language models to find and fix bugs in NLP systems
    Advances in platform models—large-scale models that can serve as foundations across applications—have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating “Eu não recomendo este prato” (I don’t recommend this dish) in Portuguese to “I highly […] The post Partnering people with large language models to find and fix bugs in NLP systems appeared first on Microsoft Research.  ( 10 min )
  • Open

    Energy Grids Plug into AI for a Brighter, Cleaner Future
    Electric utilities are taking a course in machine learning to create smarter grids for tough challenges ahead. The winter 2021 megastorm in Texas left millions without power. Grid failures the past two summers sparked devastating wildfires amid California’s record drought. “Extreme weather events of 2021 highlighted the risks climate change is introducing, and the importance Read article > The post Energy Grids Plug into AI for a Brighter, Cleaner Future appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Why Should the Retail Sector Move to Cloud Computing?
    What is cloud computing? It is one of the foremost technological innovations of the 21st century and has seen the fastest adoption into mainstream use than any other technology. Simply put, it’s the ability to store and run software programs on the cloud platforms rather than storing and running them on local servers or computers.… Read More »Why Should the Retail Sector Move to Cloud Computing? The post Why Should the Retail Sector Move to Cloud Computing? appeared first on Data Science Central.  ( 4 min )
  • Open

    Deconfounding Actor-Critic Network with Policy Adaptation for Dynamic Treatment Regimes. (arXiv:2205.09852v1 [cs.LG])
    Despite intense efforts in basic and clinical research, an individualized ventilation strategy for critically ill patients remains a major challenge. Recently, dynamic treatment regime (DTR) with reinforcement learning (RL) on electronic health records (EHR) has attracted interest from both the healthcare industry and machine learning research community. However, most learned DTR policies might be biased due to the existence of confounders. Although some treatment actions non-survivors received may be helpful, if confounders cause the mortality, the training of RL models guided by long-term outcomes (e.g., 90-day mortality) would punish those treatment actions causing the learned DTR policies to be suboptimal. In this study, we develop a new deconfounding actor-critic network (DAC) to learn optimal DTR policies for patients. To alleviate confounding issues, we incorporate a patient resampling module and a confounding balance module into our actor-critic framework. To avoid punishing the effective treatment actions non-survivors received, we design a short-term reward to capture patients' immediate health state changes. Combining short-term with long-term rewards could further improve the model performance. Moreover, we introduce a policy adaptation method to successfully transfer the learned model to new-source small-scale datasets. The experimental results on one semi-synthetic and two different real-world datasets show the proposed model outperforms the state-of-the-art models. The proposed model provides individualized treatment decisions for mechanical ventilation that could improve patient outcomes.  ( 2 min )
    Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers. (arXiv:2205.08078v2 [cs.LG] UPDATED)
    Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. The convex programs lead to {\it block nuclear-norm regularization} that promotes low rank in the latent feature and token dimensions. In particular, we show how self-attention networks implicitly clusters the tokens, based on their latent similarity. We conduct experiments for transferring a pre-trained transformer backbone for CIFAR-100 classification by fine-tuning a variety of convex attention heads. The results indicate the merits of the bias induced by attention compared with the existing MLP or linear heads.  ( 2 min )
    Let the Model Decide its Curriculum for Multitask Learning. (arXiv:2205.09898v1 [cs.LG])
    Curriculum learning strategies in prior multi-task learning approaches arrange datasets in a difficulty hierarchy either based on human perception or by exhaustively searching the optimal arrangement. However, human perception of difficulty may not always correlate well with machine interpretation leading to poor performance and exhaustive search is computationally expensive. Addressing these concerns, we propose two classes of techniques to arrange training instances into a learning curriculum based on difficulty scores computed via model-based approaches. The two classes i.e Dataset-level and Instance-level differ in granularity of arrangement. Through comprehensive experiments with 12 datasets, we show that instance-level and dataset-level techniques result in strong representations as they lead to an average performance improvement of 4.17% and 3.15% over their respective baselines. Furthermore, we find that most of this improvement comes from correctly answering the difficult instances, implying a greater efficacy of our techniques on difficult tasks.  ( 2 min )
    Incremental Learning with Differentiable Architecture and Forgetting Search. (arXiv:2205.09875v1 [cs.LG])
    As progress is made on training machine learning models on incrementally expanding classification tasks (i.e., incremental learning), a next step is to translate this progress to industry expectations. One technique missing from incremental learning is automatic architecture design via Neural Architecture Search (NAS). In this paper, we show that leveraging NAS for incremental learning results in strong performance gains for classification tasks. Specifically, we contribute the following: first, we create a strong baseline approach for incremental learning based on Differentiable Architecture Search (DARTS) and state-of-the-art incremental learning strategies, outperforming many existing strategies trained with similar-sized popular architectures; second, we extend the idea of architecture search to regularize architecture forgetting, boosting performance past our proposed baseline. We evaluate our method on both RF signal and image classification tasks, and demonstrate we can achieve up to a 10% performance increase over state-of-the-art methods. Most importantly, our contribution enables learning from continuous distributions on real-world application data for which the complexity of the data distribution is unknown, or the modality less explored (such as RF signal classification).  ( 2 min )
    Predicting electrode array impedance after one month from cochlear implantation surgery. (arXiv:2205.10021v1 [cs.LG])
    Sensorineural hearing loss can be treated using Cochlear implantation. After this surgery using the electrode array impedance measurements, we can check the stability of the impedance value and the dynamic range. Deterioration of speech recognition scores could happen because of increased impedance values. Medicines used to do these measures many times during a year after the surgery. Predicting the electrode impedance could help in taking decisions to help the patient get better hearing. In this research we used a dataset of 80 patients of children who did cochlear implantation using MED-EL FLEX28 electrode array of 12 channels. We predicted the electrode impedance on each channel after 1 month from the surgery date. We used different machine learning algorithms like neural networks and decision trees. Our results indicates that the electrode impedance can be predicted, and the best algorithm is different based on the electrode channel. Also, the accuracy level varies between 66% and 100% based on the electrode channel when accepting an error range between 0 and 3 KO. Further research is required to predict the electrode impedance after three months, six months and one year.  ( 2 min )
    Neural-Symbolic Models for Logical Queries on Knowledge Graphs. (arXiv:2205.10128v1 [cs.AI])
    Answering complex first-order logic (FOL) queries on knowledge graphs is a fundamental task for multi-hop reasoning. Traditional symbolic methods traverse a complete knowledge graph to extract the answers, which provides good interpretation for each step. Recent neural methods learn geometric embeddings for complex queries. These methods can generalize to incomplete knowledge graphs, but their reasoning process is hard to interpret. In this paper, we propose Graph Neural Network Query Executor (GNN-QE), a neural-symbolic model that enjoys the advantages of both worlds. GNN-QE decomposes a complex FOL query into relation projections and logical operations over fuzzy sets, which provides interpretability for intermediate variables. To reason about the missing links, GNN-QE adapts a graph neural network from knowledge graph completion to execute the relation projections, and models the logical operations with product fuzzy logic. Extensive experiments on 3 datasets show that GNN-QE significantly improves over previous state-of-the-art models in answering FOL queries. Meanwhile, GNN-QE can predict the number of answers without explicit supervision, and provide visualizations for intermediate variables.  ( 2 min )
    Robust Expected Information Gain for Optimal Bayesian Experimental Design Using Ambiguity Sets. (arXiv:2205.09914v1 [stat.ML])
    The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model's prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze \emph{robust expected information gain} (REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a `log-sum-exp' stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators.  ( 2 min )
    When, where, and how to add new neurons to ANNs. (arXiv:2202.08539v2 [cs.LG] UPDATED)
    Neurogenesis in ANNs is an understudied and difficult problem, even compared to other forms of structural learning like pruning. By decomposing it into triggers and initializations, we introduce a framework for studying the various facets of neurogenesis: when, where, and how to add neurons during the learning process. We present the Neural Orthogonality (NORTH*) suite of neurogenesis strategies, combining layer-wise triggers and initializations based on the orthogonality of activations or weights to dynamically grow performant networks that converge to an efficient size. We evaluate our contributions against other recent neurogenesis works across a variety of supervised learning tasks.  ( 2 min )
    Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization. (arXiv:2205.10217v1 [stat.ML])
    The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $\Omega(N)$ and, hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.  ( 2 min )
    KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation. (arXiv:2205.09921v1 [cs.CL])
    Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.  ( 2 min )
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v1 [cs.LG])
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{mei2018mean, chizat2018global}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated nonlinearity of the training dynamics. This work establishes a new linear convergence result for two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novelty logarithmic Sobolev inequality for two-layer neural networks, and uniform upper bounds on the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.  ( 2 min )
    An Artificial Neural Network Functionalized by Evolution. (arXiv:2205.10118v1 [cs.NE])
    The topology of artificial neural networks has a significant effect on their performance. Characterizing efficient topology is a field of promising research in Artificial Intelligence. However, it is not a trivial task and it is mainly experimented on through convolutional neural networks. We propose a hybrid model which combines the tensor calculus of feed-forward neural networks with Pseudo-Darwinian mechanisms. This allows for finding topologies that are well adapted for elaboration of strategies, control problems or pattern recognition tasks. In particular, the model can provide adapted topologies at early evolutionary stages, and 'structural convergence', which can found applications in robotics, big-data and artificial life.  ( 2 min )
    Topology-aware Graph Neural Networks for Learning Feasible and Adaptive ac-OPF Solutions. (arXiv:2205.10129v1 [eess.SY])
    Solving the optimal power flow (OPF) problem is a fundamental task to ensure the system efficiency and reliability in real-time electricity grid operations. We develop a new topology-informed graph neural network (GNN) approach for predicting the optimal solutions of real-time ac-OPF problem. To incorporate grid topology to the NN model, the proposed GNN-for-OPF framework innovatively exploits the locality property of locational marginal prices and voltage magnitude. Furthermore, we develop a physics-aware (ac-)flow feasibility regularization approach for general OPF learning. The advantages of our proposed designs include reduced model complexity, improved generalizability and feasibility guarantees. By providing the analytical understanding on the graph subspace stability under grid topology contingency, we show the proposed GNN can quickly adapt to varying grid topology by an efficient re-training strategy. Numerical tests on various test systems of different sizes have validated the prediction accuracy, improved flow feasibility, and topology adaptivity capability of our proposed GNN-based learning framework.  ( 2 min )
    Bounding the Effects of Continuous Treatments for Hidden Confounders. (arXiv:2204.11206v2 [stat.ME] UPDATED)
    Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified dose-response curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with simulations and three real-world observational studies that our approach can give non-trivial bounds on causal effects from continuous treatments in the presence of hidden confounders.
    Sparse Infinite Random Feature Latent Variable Modeling. (arXiv:2205.09909v1 [stat.ML])
    We propose a non-linear, Bayesian non-parametric latent variable model where the latent space is assumed to be sparse and infinite dimensional a priori using an Indian buffet process prior. A posteriori, the number of instantiated dimensions in the latent space is guaranteed to be finite. The purpose of placing the Indian buffet process on the latent variables is to: 1.) Automatically and probabilistically select the number of latent dimensions. 2.) Impose sparsity in the latent space, where the Indian buffet process will select which elements are exactly zero. Our proposed model allows for sparse, non-linear latent variable modeling where the number of latent dimensions is selected automatically. Inference is made tractable using the random Fourier approximation and we can easily implement posterior inference through Markov chain Monte Carlo sampling. This approach is amenable to many observation models beyond the Gaussian setting. We demonstrate the utility of our method on a variety of synthetic, biological and text datasets and show that we can obtain superior test set performance compared to previous latent variable models.
    RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk. (arXiv:2205.10004v1 [cs.LG])
    Failures and anomalies in large-scale software systems are unavoidable incidents. When an issue is detected, operators need to quickly and correctly identify its location to facilitate a swift repair. In this work, we consider the problem of identifying the root cause set that best explains an anomaly in multi-dimensional time series with categorical attributes. The huge search space is the main challenge, even for a small number of attributes and small value sets, the number of theoretical combinations is too large to brute force. Previous approaches have thus focused on reducing the search space, but they all suffer from various issues, requiring extensive manual parameter tuning, being too slow and thus impractical, or being incapable of finding more complex root causes. We propose RiskLoc to solve the problem of multidimensional root cause localization. RiskLoc applies a 2-way partitioning scheme and assigns element weights that linearly increase with the distance from the partitioning point. A risk score is assigned to each element that integrates two factors, 1) its weighted proportion within the abnormal partition, and 2) the relative change in the deviation score adjusted for the ripple effect property. Extensive experiments on multiple datasets verify the effectiveness and efficiency of RiskLoc, and for a comprehensive evaluation, we introduce three synthetically generated datasets that complement existing datasets. We demonstrate that RiskLoc consistently outperforms state-of-the-art baselines, especially in more challenging root cause scenarios, with gains in F1-score up to 57% over the second-best approach with comparable running times.
    Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning. (arXiv:2205.10047v1 [cs.LG])
    One of the major difficulties of reinforcement learning is learning from {\em off-policy} samples, which are collected by a different policy (behavior policy) from what the algorithm evaluates (the target policy). Off-policy learning needs to correct the distribution of the samples from the behavior policy towards that of the target policy. Unfortunately, important sampling has an inherent high variance issue which leads to poor gradient estimation in policy gradient methods. We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O), which can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. {\em This preconditioning uses the sigmoid function in a special way that when there is no policy change, the gradient is maximal and hence policy gradient will drive a big parameter update for an efficient exploration of the parameter space}. This is a novel exploration method that has not been studied before given that existing exploration methods are based on the novelty of states and actions. We compare with several best-performing algorithms on both discrete and continuous tasks and the results confirmed that {\em P3O is more off-policy than PPO} according to the "off-policyness" measured by the DEON metric, and P3O explores in a larger policy space than PPO. Results also show that our P3O maximizes the CPI objective better than PPO during the training process.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v3 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Optimal Parameter-free Online Learning with Switching Cost. (arXiv:2205.06846v2 [cs.LG] UPDATED)
    Parameter-freeness in online learning refers to the adaptivity of an algorithm with respect to the optimal decision in hindsight. In this paper, we design such algorithms in the presence of switching cost - the latter penalizes the optimistic updates required by parameter-freeness, leading to a delicate design trade-off. Based on a novel dual space scaling strategy, we propose a simple yet powerful algorithm for Online Linear Optimization (OLO) with switching cost, which improves the existing suboptimal regret bound [ZCP22a] to the optimal rate. The obtained benefit is extended to the expert setting, and the practicality of our algorithm is demonstrated through a sequential investment task.
    A toolbox for idea generation and evaluation: Machine learning, data-driven, and contest-driven approaches to support idea generation. (arXiv:2205.09840v1 [cs.LG])
    The significance and abundance of data are increasing due to the growing digital data generated from social media, sensors, scholarly literature, patents, different forms of documents published online, databases, product manuals, etc. Various data sources can be used to generate ideas, yet, in addition to bias, the size of the available digital data is a major challenge when it comes to manual analysis. Hence, human-machine interaction is essential for generating valuable ideas where machine learning and data-driven techniques generate patterns from data and serve human sense-making. However, the use of machine learning and data-driven approaches to generate ideas is a relatively new area. Moreover, it is also possible to stimulate innovation using contest-driven idea generation and evaluation. The results and contributions of this thesis can be viewed as a toolbox of idea-generation techniques, including a list of data-driven and machine learning techniques with corresponding data sources and models to support idea generation. In addition, the results include two models, one method and one framework, to better support data-driven and contest- driven idea generation. The beneficiaries of these artefacts are practitioners in data and knowledge engineering, data mining project managers, and innovation agents. Innovation agents include incubators, contest organizers, consultants, innovation accelerators, and industries. Since the proposed artefacts consist of process models augmented with AI techniques, human-centred AI is a promising area of research that can contribute to the artefacts' further development and promote creativity.
    STaR: Bootstrapping Reasoning With Reasoning. (arXiv:2203.14465v2 [cs.LG] UPDATED)
    Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.
    Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding. (arXiv:2203.16487v4 [cs.CL] UPDATED)
    Different from previous work accelerating translation at the cost of quality loss, we propose Generalized Aggressive Decoding (GAD) -- a novel decoding paradigm for lossless speedup of autoregressive translation, through the collaboration of autoregressive and non-autoregressive translation (NAT) of the Transformer. At each decoding iteration, GAD aggressively decodes a number of tokens with NAT as a draft and then verifies them in the autoregressive manner, where only the tokens that pass the verification are accepted as decoded tokens. GAD can achieve the same results as autoregressive translation but much more efficiently because both NAT drafting and autoregressive verification compute in parallel. We conduct experiments in four standard WMT benchmarks and confirm that the vanilla GAD yields exactly the same results as greedy decoding with an around $3\times$ speedup, and that its variant (GAD++) with an advanced verification strategy not only outperforms the greedy translation and even achieves the comparable translation quality with the beam search result, but also further improves the decoding speed, resulting in an around $5\times$ speedup over autoregressive translation. Moreover, GAD can be easily generalized for lossless speedup of other seq2seq tasks like Abstractive Summarization, and benefit more from stronger computing devices, demonstrating its potential to become a de facto decoding paradigm in the future. Our models and codes are available at https://github.com/hemingkx/GAD.
    Mosaic Zonotope Shadow Matching for Risk-Aware Autonomous Localization in Harsh Urban Environments. (arXiv:2205.10223v1 [cs.AI])
    Risk-aware urban localization with the Global Navigation Satellite System (GNSS) remains an unsolved problem with frequent misdetection of the user's street or side of the street. Significant advances in 3D map-aided GNSS use grid-based GNSS shadow matching alongside AI-driven line-of-sight (LOS) classifiers and server-based processing to improve localization accuracy, especially in the cross-street direction. Our prior work introduces a new paradigm for shadow matching that proposes set-valued localization with computationally efficient zonotope set representations. While existing literature improved accuracy and efficiency, the current state of shadow matching theory does not address the needs of risk-aware autonomous systems. We extend our prior work to propose Mosaic Zonotope Shadow Matching (MZSM) that employs a classifier-agnostic polytope mosaic architecture to provide risk-awareness and certifiable guarantees on urban positioning. We formulate a recursively expanding binary tree that refines an initial location estimate with set operations into smaller polytopes. Together, the smaller polytopes form a mosaic. We weight the tree branches with the probability that the user is in line of sight of the satellite and expand the tree with each new satellite observation. Our method yields an exact shadow matching distribution from which we guarantee uncertainty bounds on the user localization. We perform high-fidelity simulations using a 3D building map of San Francisco to validate our algorithm's risk-aware improvements. We demonstrate that MZSM provides certifiable guarantees across varied data-driven LOS classifier accuracies and yields a more precise understanding of the uncertainty over existing methods. We validate that our tree-based construction is efficient and tractable, computing a mosaic from 14 satellites in 0.63 seconds and growing quadratically in the satellite number.
    An alternative proof of the vulnerability of retrieval in high intrinsic dimensionality neighborhood. (arXiv:2010.00990v2 [cs.LG] UPDATED)
    This paper investigates the vulnerability of the nearest neighbors search, which is a pivotal tool in data analysis and machine learning. The vulnerability is gauged as the relative amount of perturbation that an attacker needs to add onto a dataset point in order to modify its neighbor rank w.r.t. a query. The statistical distribution of this quantity is derived from simple assumptions. Experiments on six large scale datasets validate this model up to some outliers which are explained in term of violations of the assumptions.
    Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits. (arXiv:2205.09899v1 [stat.ML])
    We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.
    Is explainable AI a race against model complexity?. (arXiv:2205.10119v1 [cs.AI])
    Explaining the behaviour of intelligent systems will get increasingly and perhaps intractably challenging as models grow in size and complexity. We may not be able to expect an explanation for every prediction made by a brain-scale model, nor can we expect explanations to remain objective or apolitical. Our functionalist understanding of these models is of less advantage than we might assume. Models precede explanations, and can be useful even when both model and explanation are incorrect. Explainability may never win the race against complexity, but this is less problematic than it seems.
    Nothing makes sense in deep learning, except in the light of evolution. (arXiv:2205.10320v1 [cs.LG])
    Deep Learning (DL) is a surprisingly successful branch of machine learning. The success of DL is usually explained by focusing analysis on a particular recent algorithm and its traits. Instead, we propose that an explanation of the success of DL must look at the population of all algorithms in the field and how they have evolved over time. We argue that cultural evolution is a useful framework to explain the success of DL. In analogy to biology, we use `development' to mean the process converting the pseudocode or text description of an algorithm into a fully trained model. This includes writing the programming code, compiling and running the program, and training the model. If all parts of the process don't align well then the resultant model will be useless (if the code runs at all!). This is a constraint. A core component of evolutionary developmental biology is the concept of deconstraints -- these are modification to the developmental process that avoid complete failure by automatically accommodating changes in other components. We suggest that many important innovations in DL, from neural networks themselves to hyperparameter optimization and AutoGrad, can be seen as developmental deconstraints. These deconstraints can be very helpful to both the particular algorithm in how it handles challenges in implementation and the overall field of DL in how easy it is for new ideas to be generated. We highlight how our perspective can both advance DL and lead to new insights for evolutionary biology.
    On the Representation Collapse of Sparse Mixture of Experts. (arXiv:2204.09179v2 [cs.CL] UPDATED)
    Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
    Towards a Holistic View on Argument Quality Prediction. (arXiv:2205.09803v1 [cs.CL])
    Argumentation is one of society's foundational pillars, and, sparked by advances in NLP and the vast availability of text data, automated mining of arguments receives increasing attention. A decisive property of arguments is their strength or quality. While there are works on the automated estimation of argument strength, their scope is narrow: they focus on isolated datasets and neglect the interactions with related argument mining tasks, such as argument identification, evidence detection, or emotional appeal. In this work, we close this gap by approaching argument quality estimation from multiple different angles: Grounded on rich results from thorough empirical evaluations, we assess the generalization capabilities of argument quality estimation across diverse domains, the interplay with related argument mining tasks, and the impact of emotions on perceived argument strength. We find that generalization depends on a sufficient representation of different domains in the training part. In zero-shot transfer and multi-task experiments, we reveal that argument quality is among the more challenging tasks but can improve others. Finally, we show that emotions play a minor role in argument quality than is often assumed.
    A Survey of Trustworthy Graph Learning: Reliability, Explainability, and Privacy Protection. (arXiv:2205.10014v1 [cs.LG])
    Deep graph learning has achieved remarkable progresses in both business and scientific areas ranging from finance and e-commerce, to drug and advanced material discovery. Despite these progresses, how to ensure various deep graph learning algorithms behave in a socially responsible manner and meet regulatory compliance requirements becomes an emerging problem, especially in risk-sensitive domains. Trustworthy graph learning (TwGL) aims to solve the above problems from a technical viewpoint. In contrast to conventional graph learning research which mainly cares about model performance, TwGL considers various reliability and safety aspects of the graph learning framework including but not limited to robustness, explainability, and privacy. In this survey, we provide a comprehensive review of recent leading approaches in the TwGL field from three dimensions, namely, reliability, explainability, and privacy protection. We give a general categorization for existing work and review typical work for each category. To give further insights for TwGL research, we provide a unified view to inspect previous works and build the connection between them. We also point out some important open problems remaining to be solved in the future developments of TwGL.
    Real Time Multi-Object Detection for Helmet Safety. (arXiv:2205.09878v1 [cs.CV])
    The National Football League and Amazon Web Services teamed up to develop the best sports injury surveillance and mitigation program via the Kaggle competition. Through which the NFL wants to assign specific players to each helmet, which would help accurately identify each player's "exposures" throughout a football play. We are trying to implement a computer vision based ML algorithms capable of assigning detected helmet impacts to correct players via tracking information. Our paper will explain the approach to automatically track player helmets and their collisions. This will also allow them to review previous plays and explore the trends in exposure over time.
    On Tackling Explanation Redundancy in Decision Trees. (arXiv:2205.09971v1 [cs.AI])
    Decision trees (DTs) epitomize the ideal of interpretability of machine learning (ML) models. The interpretability of decision trees motivates explainability approaches by so-called intrinsic interpretability, and it is at the core of recent proposals for applying interpretable ML models in high-risk applications. The belief in DT interpretability is justified by the fact that explanations for DT predictions are generally expected to be succinct. Indeed, in the case of DTs, explanations correspond to DT paths. Since decision trees are ideally shallow, and so paths contain far fewer features than the total number of features, explanations in DTs are expected to be succinct, and hence interpretable. This paper offers both theoretical and experimental arguments demonstrating that, as long as interpretability of decision trees equates with succinctness of explanations, then decision trees ought not be deemed interpretable. The paper introduces logically rigorous path explanations and path explanation redundancy, and proves that there exist functions for which decision trees must exhibit paths with arbitrarily large explanation redundancy. The paper also proves that only a very restricted class of functions can be represented with DTs that exhibit no explanation redundancy. In addition, the paper includes experimental results substantiating that path explanation redundancy is observed ubiquitously in decision trees, including those obtained using different tree learning algorithms, but also in a wide range of publicly available decision trees. The paper also proposes polynomial-time algorithms for eliminating path explanation redundancy, which in practice require negligible time to compute. Thus, these algorithms serve to indirectly attain irreducible, and so succinct, explanations for decision trees.
    Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores. (arXiv:2202.01748v2 [stat.ME] UPDATED)
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset.
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v3 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
    Permutation Predictions for Non-Clairvoyant Scheduling. (arXiv:2202.10199v2 [cs.DS] UPDATED)
    In non-clairvoyant scheduling, the task is to find an online strategy for scheduling jobs with a priori unknown processing requirements with the objective to minimize the total (weighted) completion time. We revisit this well-studied problem in a recently popular learning-augmented setting that integrates (untrusted) predictions in online algorithm design. While previous works used predictions on processing requirements, we propose a new prediction model, which provides a relative order of jobs which could be seen as predicting algorithmic actions rather than parts of the unknown input. We show that these predictions have desired properties, admit a natural error measure as well as algorithms with strong performance guarantees and that they are learnable in both, theory and practice. We generalize the algorithmic framework proposed in the seminal paper by Kumar et al. (NeurIPS'18) and present the first learning-augmented scheduling results for weighted jobs and unrelated machines. We demonstrate in empirical experiments the practicability and superior performance compared to the previously suggested single-machine algorithms.
    Function Regression using Spiking DeepONet. (arXiv:2205.10130v1 [cs.NE])
    One of the main broad applications of deep learning is function regression. However, despite their demonstrated accuracy and robustness, modern neural network architectures require heavy computational resources to train. One method to mitigate or even resolve this inefficiency has been to draw further inspiration from the brain and reformulate the learning process in a more biologically-plausible way, developing what are known as Spiking Neural Networks (SNNs), which have been gaining traction in recent years. In this paper we present an SNN-based method to perform regression, which has been a challenge due to the inherent difficulty in representing a function's input domain and continuous output values as spikes. We use a DeepONet - neural network designed to learn operators - to learn the behavior of spikes. Then, we use this approach to do function regression. We propose several methods to use a DeepONet in the spiking framework, and present accuracy and training time for different benchmarks.
    Millimeter Wave Localization with Imperfect Training Data using Shallow Neural Networks. (arXiv:2112.05008v2 [cs.NI] UPDATED)
    Millimeter wave (mmWave) localization algorithms exploit the quasi-optical propagation of mmWave signals, which yields sparse angular spectra at the receiver. Geometric approaches to angle-based localization typically require to know the map of the environment and the location of the access points. Thus, several works have resorted to automated learning in order to infer a device's location from the properties of the received mmWave signals. However, collecting training data for such models is a significant burden. In this work, we propose a shallow neural network model to localize mmWave devices indoors. This model requires significantly fewer weights than those proposed in the literature. Therefore, it is amenable for implementation in resource-constrained hardware, and needs fewer training samples to converge. We also propose to relieve training data collection efforts by retrieving (inherently imperfect) location estimates from geometry-based mmWave localization algorithms. Even in this case, our results show that the proposed neural networks perform as good as or better than state-of-the-art algorithms.
    Beyond Labels: Visual Representations for Bone Marrow Cell Morphology Recognition. (arXiv:2205.09880v1 [cs.CV])
    Analyzing and inspecting bone marrow cell cytomorphology is a critical but highly complex and time-consuming component of hematopathology diagnosis. Recent advancements in artificial intelligence have paved the way for the application of deep learning algorithms to complex medical tasks. Nevertheless, there are many challenges in applying effective learning algorithms to medical image analysis, such as the lack of sufficient and reliably annotated training datasets and the highly class-imbalanced nature of most medical data. Here, we improve on the state-of-the-art methodologies of bone marrow cell recognition by deviating from sole reliance on labeled data and leveraging self-supervision in training our learning models. We investigate our approach's effectiveness in identifying bone marrow cell types. Our experiments demonstrate significant performance improvements in conducting different bone marrow cell recognition tasks compared to the current state-of-the-art methodologies.
    Towards efficient feature sharing in MIMO architectures. (arXiv:2205.10139v1 [cs.LG])
    Multi-input multi-output architectures propose to train multiple subnetworks within one base network and then average the subnetwork predictions to benefit from ensembling for free. Despite some relative success, these architectures are wasteful in their use of parameters. Indeed, we highlight in this paper that the learned subnetwork fail to share even generic features which limits their applicability on smaller mobile and AR/VR devices. We posit this behavior stems from an ill-posed part of the multi-input multi-output framework. To solve this issue, we propose a novel unmixing step in MIMO architectures that allows subnetworks to properly share features. Preliminary experiments on CIFAR-100 show our adjustments allow feature sharing and improve model performance for small architectures.
    WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding. (arXiv:2108.12603v2 [cs.CL] UPDATED)
    Building machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been proven valuable when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. It is thus hard to compare different approaches and evaluate the benefit of weak supervision without access to a unified and systematic benchmark with diverse tasks and real-world weak labeling rules. In this paper, we propose such a benchmark, named WALNUT (semi-WeAkly supervised Learning for Natural language Understanding Testbed), to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including document-level and token-level prediction tasks. WALNUT is the first semi-weakly supervised learning benchmark for NLU, where each task contains weak labels generated by multiple real-world weak sources, together with a small set of clean labels. We conduct baseline evaluations on WALNUT to systematically evaluate the effectiveness of various weak supervision methods and model architectures. Our results demonstrate the benefit of weak supervision for low-resource NLU tasks and highlight interesting patterns across tasks. We expect WALNUT to stimulate further research on methodologies to leverage weak supervision more effectively. The benchmark and code for baselines are available at \url{aka.ms/walnut_benchmark}.
    HyBNN and FedHyBNN: (Federated) Hybrid Binary Neural Networks. (arXiv:2205.09839v1 [cs.LG])
    Binary Neural Networks (BNNs), neural networks with weights and activations constrained to -1(0) and +1, are an alternative to deep neural networks which offer faster training, lower memory consumption and lightweight models, ideal for use in resource constrained devices while being able to utilize the architecture of their deep neural network counterpart. However, the input binarization step used in BNNs causes a severe accuracy loss. In this paper, we introduce a novel hybrid neural network architecture, Hybrid Binary Neural Network (HyBNN), consisting of a task-independent, general, full-precision variational autoencoder with a binary latent space and a task specific binary neural network that is able to greatly limit the accuracy loss due to input binarization by using the full precision variational autoencoder as a feature extractor. We use it to combine the state-of-the-art accuracy of deep neural networks with the much faster training time, quicker test-time inference and power efficiency of binary neural networks. We show that our proposed system is able to very significantly outperform a vanilla binary neural network with input binarization. We also introduce FedHyBNN, a highly communication efficient federated counterpart to HyBNN and demonstrate that it is able to reach the same accuracy as its non-federated equivalent. We make our source code, experimental parameters and models available at: https://anonymous.4open.science/r/HyBNN.
    MaskGAE: Masked Graph Modeling Meets Graph Autoencoders. (arXiv:2205.10053v1 [cs.LG])
    We present masked graph autoencoder (MaskGAE), a self-supervised learning framework for graph-structured data. Different from previous graph autoencoders (GAEs), MaskGAE adopts masked graph modeling (MGM) as a principled pretext task: masking a portion of edges and attempting to reconstruct the missing part with partially visible, unmasked graph structure. To understand whether MGM can help GAEs learn better representations, we provide both theoretical and empirical evidence to justify the benefits of this pretext task. Theoretically, we establish the connections between GAEs and contrastive learning, showing that MGM significantly improves the self-supervised learning scheme of GAEs. Empirically, we conduct extensive experiments on a number of benchmark datasets, demonstrating the superiority of MaskGAE over several state-of-the-arts on both link prediction and node classification tasks. Our code is publicly available at \url{https://github.com/EdisonLeeeee/MaskGAE}.
    Stochastic resonance neurons in artificial neural networks. (arXiv:2205.10122v1 [cs.NE])
    Many modern applications of the artificial neural networks ensue large number of layers making traditional digital implementations increasingly complex. Optical neural networks offer parallel processing at high bandwidth, but have the challenge of noise accumulation. We propose here a new type of neural networks using stochastic resonances as an inherent part of the architecture and demonstrate a possibility of significant reduction of the required number of neurons for a given performance accuracy. We also show that such a neural network is more robust against the impact of noise.
    A Rule Search Framework for the Early Identification of Chronic Emergency Homeless Shelter Clients. (arXiv:2205.09883v1 [cs.CY])
    This paper uses rule search techniques for the early identification of emergency homeless shelter clients who are at risk of becoming long term or chronic shelter users. Using a data set from a major North American shelter containing 12 years of service interactions with over 40,000 individuals, the optimized pruning for unordered search (OPUS) algorithm is used to develop rules that are both intuitive and effective. The rules are evaluated within a framework compatible with the real-time delivery of a housing program meant to transition high risk clients to supportive housing. Results demonstrate that the median time to identification of clients at risk of chronic shelter use drops from 297 days to 162 days when the methods in this paper are applied.
    On the Prediction Instability of Graph Neural Networks. (arXiv:2205.10070v1 [cs.LG])
    Instability of trained models, i.e., the dependence of individual node predictions on random factors, can affect reproducibility, reliability, and trust in machine learning systems. In this paper, we systematically assess the prediction instability of node classification with state-of-the-art Graph Neural Networks (GNNs). With our experiments, we establish that multiple instantiations of popular GNN models trained on the same data with the same model hyperparameters result in almost identical aggregated performance but display substantial disagreement in the predictions for individual nodes. We find that up to one third of the incorrectly classified nodes differ across algorithm runs. We identify correlations between hyperparameters, node properties, and the size of the training set with the stability of predictions. In general, maximizing model performance implicitly also reduces model instability.
    Self-Paced Multi-Agent Reinforcement Learning. (arXiv:2205.10016v1 [cs.AI])
    Curriculum reinforcement learning (CRL) aims to speed up learning of a task by changing gradually the difficulty of the task from easy to hard through control of factors such as initial state or environment dynamics. While automating CRL is well studied in the single-agent setting, in multi-agent reinforcement learning (MARL) an open question is whether control of the number of agents with other factors in a principled manner is beneficial, prior approaches typically relying on hand-crafted heuristics. In addition, how the tasks evolve as the number of agents changes remains understudied, which is critical for scaling to more challenging tasks. We introduce self-paced MARL (SPMARL) that enables optimizing the number of agents with other environment factors in a principled way, and, show that usual assumptions such as that fewer agents make the task always easier are not generally valid. The curriculum induced by SPMARL reveals the evolution of tasks w.r.t. number of agents and experiments show that SPMARL improves the performance when the number of agents sufficiently influences task difficulty.
    HCMD-zero: Learning Value Aligned Mechanisms from Data. (arXiv:2202.10122v2 [cs.MA] UPDATED)
    Artificial learning agents are mediating a larger and larger number of interactions among humans, firms, and organizations, and the intersection between mechanism design and machine learning has been heavily investigated in recent years. However, mechanism design methods often make strong assumptions on how participants behave (e.g. rationality), on the kind of knowledge designers have access to a priori (e.g. access to strong baseline mechanisms), or on what the goal of the mechanism should be (e.g. total welfare). Here we introduce HCMD-zero, a general purpose method to construct mechanisms making none of these three assumptions. HCMD-zero learns to mediate interactions among participants and adjusts the mechanism parameters to make itself more likely to be preferred by participants. It does so by remaining engaged in an electoral contest with copies of itself, thereby accessing direct feedback from participants. We test our method on a stylized resource allocation game that highlights the tension between productivity, equality and the temptation to free ride. HCMD-zero produces a mechanism that is preferred by human participants over a strong baseline, it does so automatically, without requiring prior knowledge, and using human behavioral trajectories sparingly and effectively. Our analysis shows HCMD-zero consistently makes the mechanism policy more and more likely to be preferred by human participants over the course of training, and that it results in a mechanism with an interpretable and intuitive policy.
    How to Minimize the Weighted Sum AoI in Multi-Source Status Update Systems: OMA or NOMA?. (arXiv:2205.03143v2 [cs.IT] UPDATED)
    In this paper, the minimization of the weighted sum average age of information (AoI) in a multi-source status update communication system is studied. Multiple independent sources send update packets to a common destination node in a time-slotted manner under the limit of maximum retransmission rounds. Different multiple access schemes, i.e., orthogonal multiple access (OMA) and non-orthogonal multiple access (NOMA) are exploited here over a block-fading multiple access channel (MAC). Constrained Markov decision process (CMDP) problems are formulated to describe the AoI minimization problems considering both transmission schemes. The Lagrangian method is utilised to convert CMDP problems to unconstraint Markov decision process (MDP) problems and corresponding algorithms to derive the power allocation policies are obtained. On the other hand, for the case of unknown environments, two online reinforcement learning approaches considering both multiple access schemes are proposed to achieve near-optimal age performance. Numerical simulations validate the improvement of the proposed policy in terms of weighted sum AoI compared to the fixed power transmission policy, and illustrate that NOMA is more favorable in case of larger packet size.
    Deep electric field predictions by drift-reduced Braginskii theory with plasma-neutral interactions based upon experimental images of boundary turbulence. (arXiv:2204.11689v1 [physics.plasm-ph] CROSS LISTED)
    We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature obtained from analysis of gas puff imaging of a discharge on the Alcator C-Mod tokamak. The inclusion of effects from the locally puffed atomic helium on particle and energy sources within the reduced plasma turbulence model are found to strengthen correlations between the electric field and electron pressure. The neutrals are also directly associated with an observed broadening in the distribution of turbulent field amplitudes and increased ${\bf E \times B}$ shearing rates.
    DDDM: a Brain-Inspired Framework for Robust Classification. (arXiv:2205.10117v1 [cs.NE])
    Despite their outstanding performance in a broad spectrum of real-world tasks, deep artificial neural networks are sensitive to input noises, particularly adversarial perturbations. On the contrary, human and animal brains are much less vulnerable. In contrast to the one-shot inference performed by most deep neural networks, the brain often solves decision-making with an evidence accumulation mechanism that may trade time for accuracy when facing noisy inputs. The mechanism is well described by the Drift-Diffusion Model (DDM). In the DDM, decision-making is modeled as a process in which noisy evidence is accumulated toward a threshold. Drawing inspiration from the DDM, we propose the Dropout-based Drift-Diffusion Model (DDDM) that combines test-phase dropout and the DDM for improving the robustness for arbitrary neural networks. The dropouts create temporally uncorrelated noises in the network that counter perturbations, while the evidence accumulation mechanism guarantees a reasonable decision accuracy. Neural networks enhanced with the DDDM tested in image, speech, and text classification tasks all significantly outperform their native counterparts, demonstrating the DDDM as a task-agnostic defense against adversarial attacks.
    Classifying Human Activities using Machine Learning and Deep Learning Techniques. (arXiv:2205.10325v1 [cs.LG])
    Human Activity Recognition (HAR) describes the machines ability to recognize human actions. Nowadays, most people on earth are health conscious, so people are more interested in tracking their daily activities using Smartphones or Smart Watches, which can help them manage their daily routines in a healthy way. With this objective, Kaggle has conducted a competition to classify 6 different human activities distinctly based on the inertial signals obtained from 30 volunteers smartphones. The main challenge in HAR is to overcome the difficulties of separating human activities based on the given data such that no two activities overlap. In this experimentation, first, Data visualization is done on expert generated features with the help of t distributed Stochastic Neighborhood Embedding followed by applying various Machine Learning techniques like Logistic Regression, Linear SVC, Kernel SVM, Decision trees to better classify the 6 distinct human activities. Moreover, Deep Learning techniques like Long Short-Term Memory (LSTM), Bi-Directional LSTM, Recurrent Neural Network (RNN), and Gated Recurrent Unit (GRU) are trained using raw time series data. Finally, metrics like Accuracy, Confusion matrix, precision and recall are used to evaluate the performance of the Machine Learning and Deep Learning models. Experiment results proved that the Linear Support Vector Classifier in machine learning and Gated Recurrent Unit in Deep Learning provided better accuracy for human activity recognition compared to other classifiers.
    Hybrid Transfer in Deep Reinforcement Learning for Ads Allocation. (arXiv:2204.11589v2 [cs.IR] UPDATED)
    Ads allocation, which involves allocating ads and organic items to limited slots in feed with the purpose of maximizing platform revenue, has become a research hotspot. Notice that, e-commerce platforms usually have multiple entrances for different categories and some entrances have few visits. Data from these entrances has low coverage, which makes it difficult for the agent to learn. To address this challenge, we propose Similarity-based Hybrid Transfer for Ads Allocation (SHTAA), which effectively transfers samples as well as knowledge from data-rich entrance to data-poor entrance. Specifically, we define an uncertainty-aware similarity for MDP to estimate the similarity of MDP for different entrances. Based on this similarity, we design a hybrid transfer method, including instance transfer and strategy transfer, to efficiently transfer samples and knowledge from one entrance to another. Both offline and online experiments on Meituan food delivery platform demonstrate that the proposed method could achieve better performance for data-poor entrance and increase the revenue for the platform.
    Can Foundation Models Wrangle Your Data?. (arXiv:2205.09911v1 [cs.LG])
    Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast three data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and temporal data, and opportunities to make data driven systems more accessible to non-experts. We make our code and experiments publicly available at: https://github.com/HazyResearch/fm_data_tasks.
    Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting. (arXiv:2205.10234v1 [cs.CL])
    Inpatient violence is a common and severe problem within psychiatry. Knowing who might become violent can influence staffing levels and mitigate severity. Predictive machine learning models can assess each patient's likelihood of becoming violent based on clinical notes. Yet, while machine learning models benefit from having more data, data availability is limited as hospitals typically do not share their data for privacy preservation. Federated Learning (FL) can overcome the problem of data limitation by training models in a decentralised manner, without disclosing data between collaborators. However, although several FL approaches exist, none of these train Natural Language Processing models on clinical notes. In this work, we investigate the application of Federated Learning to clinical Natural Language Processing, applied to the task of Violence Risk Assessment by simulating a cross-institutional psychiatric setting. We train and compare four models: two local models, a federated model and a data-centralised model. Our results indicate that the federated model outperforms the local models and has similar performance as the data-centralised model. These findings suggest that Federated Learning can be used successfully in a cross-institutional setting and is a step towards new applications of Federated Learning based on clinical notes
    Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations. (arXiv:2203.09590v3 [cs.CL] UPDATED)
    Within the emerging research efforts to combine structured and unstructured knowledge, many approaches incorporate factual knowledge, e.g., available in form of structured knowledge graphs (KGs), into pre-trained language models (PLMs) and then apply the knowledge-enhanced PLMs to downstream NLP tasks. However, (1) they typically only consider \textit{static} factual knowledge, whereas, e.g., knowledge graphs (KGs) also contain \textit{temporal facts} or \textit{events} indicating evolutionary relationships among entities at different timestamps. (2) PLMs cannot be directly applied to many KG tasks, such as temporal KG completion. In this paper, we focus on \textbf{e}nhancing temporal knowledge embeddings with \textbf{co}ntextualized \textbf{la}nguage representations (ECOLA). We align structured knowledge, contained in temporal knowledge graphs, with their textual descriptions extracted from news articles, and propose a novel knowledge-text prediction task to inject the abundant information from descriptions into temporal knowledge embeddings. ECOLA jointly optimizes the knowledge-text prediction objective and the temporal knowledge embeddings, which can simultaneously take full advantage of textual and knowledge information. The proposed fusion method is model-agnostic and can be combined with potentially any temporal KG model. For training ECOLA, we introduce three temporal KG datasets with aligned textual descriptions. Experimental results on the temporal knowledge graph completion task show that ECOLA outperforms state-of-the-art temporal KG models by a large margin. The proposed datasets can serve as new temporal KG benchmarks and facilitate future research on structured and unstructured knowledge integration.
    Learning a Large Neighborhood Search Algorithm for Mixed Integer Programs. (arXiv:2107.10201v3 [math.OC] UPDATED)
    Large Neighborhood Search (LNS) is a combinatorial optimization heuristic that starts with an assignment of values for the variables to be optimized, and iteratively improves it by searching a large neighborhood around the current assignment. In this paper we consider a learning-based LNS approach for mixed integer programs (MIPs). We train a Neural Diving model to represent a probability distribution over assignments, which, together with an off-the-shelf MIP solver, generates an initial assignment. Formulating the subsequent search steps as a Markov Decision Process, we train a Neural Neighborhood Selection policy to select a search neighborhood at each step, which is searched using a MIP solver to find the next assignment. The policy network is trained using imitation learning. We propose a target policy for imitation that, given enough compute resources, is guaranteed to select the neighborhood containing the optimal next assignment amongst all possible choices for the neighborhood of a specified size. Our approach matches or outperforms all the baselines on five real-world MIP datasets with large-scale instances from diverse applications, including two production applications at Google. It achieves $2\times$ to $37.8\times$ better average primal gap than the best baseline on three of the datasets at large running times.
    Learning Task-relevant Representations for Generalization via Characteristic Functions of Reward Sequence Distributions. (arXiv:2205.10218v1 [cs.LG])
    Generalization across different environments with the same tasks is critical for successful applications of visual reinforcement learning (RL) in real scenarios. However, visual distractions -- which are common in real scenes -- from high-dimensional observations can be hurtful to the learned representations in visual RL, thus degrading the performance of generalization. To tackle this problem, we propose a novel approach, namely Characteristic Reward Sequence Prediction (CRESP), to extract the task-relevant information by learning reward sequence distributions (RSDs), as the reward signals are task-relevant in RL and invariant to visual distractions. Specifically, to effectively capture the task-relevant information via RSDs, CRESP introduces an auxiliary task -- that is, predicting the characteristic functions of RSDs -- to learn task-relevant representations, because we can well approximate the high-dimensional distributions by leveraging the corresponding characteristic functions. Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with different visual distractions.
    Explanatory machine learning for sequential human teaching. (arXiv:2205.10250v1 [cs.AI])
    The topic of comprehensibility of machine-learned theories has recently drawn increasing attention. Inductive Logic Programming (ILP) uses logic programming to derive logic theories from small data based on abduction and induction techniques. Learned theories are represented in the form of rules as declarative descriptions of obtained knowledge. In earlier work, the authors provided the first evidence of a measurable increase in human comprehension based on machine-learned logic rules for simple classification tasks. In a later study, it was found that the presentation of machine-learned explanations to humans can produce both beneficial and harmful effects in the context of game learning. We continue our investigation of comprehensibility by examining the effects of the ordering of concept presentations on human comprehension. In this work, we examine the explanatory effects of curriculum order and the presence of machine-learned explanations for sequential problem-solving. We show that 1) there exist tasks A and B such that learning A before B has a better human comprehension with respect to learning B before A and 2) there exist tasks A and B such that the presence of explanations when learning A contributes to improved human comprehension when subsequently learning B. We propose a framework for the effects of sequential teaching on comprehension based on an existing definition of comprehensibility and provide evidence for support from data collected in human trials. Empirical results show that sequential teaching of concepts with increasing complexity a) has a beneficial effect on human comprehension and b) leads to human re-discovery of divide-and-conquer problem-solving strategies, and c) studying machine-learned explanations allows adaptations of human problem-solving strategy with better performance.
    A Review of Safe Reinforcement Learning: Methods, Theory and Applications. (arXiv:2205.10330v1 [cs.AI])
    Reinforcement learning has achieved tremendous success in many complex decision making tasks. When it comes to deploying RL in the real world, safety concerns are usually raised, leading to a growing demand for safe reinforcement learning algorithms, such as in autonomous driving and robotics scenarios. While safety control has a long history, the study of safe RL algorithms is still in the early stages. To establish a good foundation for future research in this thread, in this paper, we provide a review for safe RL from the perspectives of methods, theory and applications. Firstly, we review the progress of safe RL from five dimensions and come up with five problems that are crucial for safe RL being deployed in real-world applications, coined as "2H3W". Secondly, we analyze the theory and algorithm progress from the perspectives of answering the "2H3W" problems. Then, the sample complexity of safe RL methods is reviewed and discussed, followed by an introduction of the applications and benchmarks of safe RL algorithms. Finally, we open the discussion of the challenging problems in safe RL, hoping to inspire more future research on this thread. To advance the study of safe RL algorithms, we release a benchmark suite, an open-sourced repository containing the implementations of major safe RL algorithms, along with tutorials at the link: https://github.com/chauncygu/Safe-Reinforcement-Learning-Baselines.git.
    Automated Scoring for Reading Comprehension via In-context BERT Tuning. (arXiv:2205.09864v1 [cs.LG])
    Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two limitations: 1) they fail to leverage item linkage for scenarios such as reading comprehension where multiple items may share a reading passage; 2) they are not scalable since storing one model per item becomes difficult when models have a large number of parameters. In this paper, we report our (grand prize-winning) solution to the National Assessment of Education Progress (NAEP) automated scoring challenge for reading comprehension. Our approach, in-context BERT fine-tuning, produces a single shared scoring model for all items with a carefully-designed input structure to provide contextual information on each item. We demonstrate the effectiveness of our approach via local evaluations using the training dataset provided by the challenge. We also discuss the biases, common error types, and limitations of our approach.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v1 [cs.LG])
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    Algorithms for Weak Optimal Transport with an Application to Economics. (arXiv:2205.09825v1 [stat.ML])
    The theory of weak optimal transport (WOT), introduced by [Gozlan et al., 2017], generalizes the classic Monge-Kantorovich framework by allowing the transport cost between one point and the points it is matched with to be nonlinear. In the so-called barycentric version of WOT, the cost for transporting a point $x$ only depends on $x$ and on the barycenter of the points it is matched with. This aggregation property of WOT is appealing in machine learning, economics and finance. Yet algorithms to compute WOT have only been developed for the special case of quadratic barycentric WOT, or depend on neural networks with no guarantee on the computed value and matching. The main difficulty lies in the transportation constraints which are costly to project onto. In this paper, we propose to use mirror descent algorithms to solve the primal and dual versions of the WOT problem. We also apply our algorithms to the variant of WOT introduced by [Chon\'e et al., 2022] where mass is distributed from one space to another through unnormalized kernels (WOTUK). We empirically compare the solutions of WOT and WOTUK with classical OT. We illustrate our numerical methods to the economic framework of [Chon\'e and Kramarz, 2021], namely the matching between workers and firms on labor markets.
    Speeding up PCA with priming. (arXiv:2109.03709v3 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we prime pPCA by several approximate algorithms and report an average speedup by a factor of 7.2 over Oja's rule, and a factor of 10.5 over EigenGame.
    Track Boosting and Synthetic Data Aided Drone Detection. (arXiv:2111.12389v5 [cs.CV] UPDATED)
    This is the paper for the first place winning solution of the Drone vs. Bird Challenge, organized by AVSS 2021. As the usage of drones increases with lowered costs and improved drone technology, drone detection emerges as a vital object detection task. However, detecting distant drones under unfavorable conditions, namely weak contrast, long-range, low visibility, requires effective algorithms. Our method approaches the drone detection problem by fine-tuning a YOLOv5 model with real and synthetically generated data using a Kalman-based object tracker to boost detection confidence. Our results indicate that augmenting the real data with an optimal subset of synthetic data can increase the performance. Moreover, temporal information gathered by object tracking methods can increase performance further.
    Robust Multi-Task Learning and Online Refinement for Spacecraft Pose Estimation across Domain Gap. (arXiv:2203.04275v3 [cs.CV] UPDATED)
    This work presents Spacecraft Pose Network v2 (SPNv2), a Convolutional Neural Network (CNN) for pose estimation of noncooperative spacecraft across domain gap. SPNv2 is a multi-scale, multi-task CNN which consists of a shared multi-scale feature encoder and multiple prediction heads that perform different tasks on a shared feature output. These tasks are all related to detection and pose estimation of a target spacecraft from an image, such as prediction of pre-defined satellite keypoints, direct pose regression, and binary segmentation of the satellite foreground. It is shown that by jointly training on different yet related tasks with extensive data augmentations on synthetic images only, the shared encoder learns features that are common across image domains that have fundamentally different visual characteristics compared to synthetic images. This work also introduces Online Domain Refinement (ODR) which refines the parameters of the normalization layers of SPNv2 on the target domain images online at deployment. Specifically, ODR performs self-supervised entropy minimization of the predicted satellite foreground, thereby improving the CNN's performance on the target domain images without their pose labels and with minimal computational efforts. The GitHub repository for SPNv2 is available at \url{https://github.com/tpark94/spnv2}.
    Content-Context Factorized Representations for Automated Speech Recognition. (arXiv:2205.09872v1 [eess.AS])
    Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.
    Delator: Automatic Detection of Money Laundering Evidence on Transaction Graphs via Neural Networks. (arXiv:2205.10293v1 [cs.LG])
    Money laundering is one of the most relevant criminal activities today, due to its potential to cause massive financial losses to governments, banks, etc. We propose DELATOR, a new CAAT (computer-assisted audit technology) to detect money laundering activities based on neural network models that encode bank transfers as a large-scale temporal graph. In collaboration with a Brazilian bank, we design and apply an evaluation strategy to quantify DELATOR's performance on historic data comprising millions of clients. DELATOR outperforms an off-the-shelf solution from Amazon AWS by 18.9% with respect to AUC. We conducted real experiments that led to discovery of 8 new suspicious among 100 analyzed cases, which would have been reported to the authorities under the current criteria.
    What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment. (arXiv:2205.10327v1 [stat.ME])
    The fundamental problem of causal inference -- that we never observe counterfactuals -- prevents us from identifying how many might be negatively affected by a proposed intervention. If, in an A/B test, half of users click (or buy, or watch, or renew, etc.), whether exposed to the standard experience A or a new one B, hypothetically it could be because the change affects no one, because the change positively affects half the user population to go from no-click to click while negatively affecting the other half, or something in between. While unknowable, this impact is clearly of material importance to the decision to implement a change or not, whether due to fairness, long-term, systemic, or operational considerations. We therefore derive the tightest-possible (i.e., sharp) bounds on the fraction negatively affected (and other related estimands) given data with only factual observations, whether experimental or observational. Naturally, the more we can stratify individuals by observable covariates, the tighter the sharp bounds. Since these bounds involve unknown functions that must be learned from data, we develop a robust inference algorithm that is efficient almost regardless of how and how fast these functions are learned, remains consistent when some are mislearned, and still gives valid conservative bounds when most are mislearned. Our methodology altogether therefore strongly supports credible conclusions: it avoids spuriously point-identifying this unknowable impact, focusing on the best bounds instead, and it permits exceedingly robust inference on these. We demonstrate our method in simulation studies and in a case study of career counseling for the unemployed.
    Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) for Short-Term Forecasting of Transit Passenger Flow. (arXiv:2107.13226v2 [cs.LG] UPDATED)
    Short-term forecasting of passenger flow is critical for transit management and crowd regulation. Spatial dependencies, temporal dependencies, inter-station correlations driven by other latent factors, and exogenous factors bring challenges to the short-term forecasts of passenger flow of urban rail transit networks. An innovative deep learning approach, Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) is proposed to forecast passenger flow in urban rail transit systems to incorporate these complex factors. We propose to use multiple graphs to encode the spatial and other heterogenous inter-station correlations. The temporal dynamics of the inter-station correlations are also modeled via the proposed multi-graph convolutional-recurrent neural network structure. Inflow and outflow of all stations can be collectively predicted with multiple time steps ahead via a sequence to sequence(seq2seq) architecture. The proposed method is applied to the short-term forecasts of passenger flow in Shenzhen Metro, China. The experimental results show that MGC-RNN outperforms the benchmark algorithms in terms of forecasting accuracy. Besides, it is found that the inter-station driven by network distance, network structure, and recent flow patterns are significant factors for passenger flow forecasting. Moreover, the architecture of LSTM-encoder-decoder can capture the temporal dependencies well. In general, the proposed framework could provide multiple views of passenger flow dynamics for fine prediction and exhibit a possibility for multi-source heterogeneous data fusion in the spatiotemporal forecast tasks.
    Adaptor: Objective-Centric Adaptation Framework for Language Models. (arXiv:2203.03989v2 [cs.CL] UPDATED)
    Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation. Adaptor aims to ease reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.
    Capturing cross-session neural population variability through self-supervised identification of consistent neuron ensembles. (arXiv:2205.09829v1 [q-bio.NC])
    Decoding stimuli or behaviour from recorded neural activity is a common approach to interrogate brain function in research, and an essential part of brain-computer and brain-machine interfaces. Reliable decoding even from small neural populations is possible because high dimensional neural population activity typically occupies low dimensional manifolds that are discoverable with suitable latent variable models. Over time however, drifts in activity of individual neurons and instabilities in neural recording devices can be substantial, making stable decoding over days and weeks impractical. While this drift cannot be predicted on an individual neuron level, population level variations over consecutive recording sessions such as differing sets of neurons and varying permutations of consistent neurons in recorded data may be learnable when the underlying manifold is stable over time. Classification of consistent versus unfamiliar neurons across sessions and accounting for deviations in the order of consistent recording neurons in recording datasets over sessions of recordings may then maintain decoding performance. In this work we show that self-supervised training of a deep neural network can be used to compensate for this inter-session variability. As a result, a sequential autoencoding model can maintain state-of-the-art behaviour decoding performance for completely unseen recording sessions several days into the future. Our approach only requires a single recording session for training the model, and is a step towards reliable, recalibration-free brain computer interfaces.
    Evolving SimGANs to Improve Abnormal Electrocardiogram Classification. (arXiv:2205.10116v1 [cs.NE])
    Machine Learning models are used in a wide variety of domains. However, machine learning methods often require a large amount of data in order to be successful. This is especially troublesome in domains where collecting real-world data is difficult and/or expensive. Data simulators do exist for many of these domains, but they do not sufficiently reflect the real world data due to factors such as a lack of real-world noise. Recently generative adversarial networks (GANs) have been modified to refine simulated image data into data that better fits the real world distribution, using the SimGAN method. While evolutionary computing has been used for GAN evolution, there are currently no frameworks that can evolve a SimGAN. In this paper we (1) extend the SimGAN method to refine one-dimensional data, (2) modify Easy Cartesian Genetic Programming (ezCGP), an evolutionary computing framework, to create SimGANs that more accurately refine simulated data, and (3) create new feature-based quantitative metrics to evaluate refined data. We also use our framework to augment an electrocardiogram (ECG) dataset, a domain that suffers from the issues previously mentioned. In particular, while healthy ECGs can be simulated there are no current simulators of abnormal ECGs. We show that by using an evolved SimGAN to refine simulated healthy ECG data to mimic real-world abnormal ECGs, we can improve the accuracy of abnormal ECG classifiers.
    LeNSE: Learning To Navigate Subgraph Embeddings for Large-Scale Combinatorial Optimisation. (arXiv:2205.10106v1 [cs.LG])
    Combinatorial Optimisation problems arise in several application domains and are often formulated in terms of graphs. Many of these problems are NP-hard, but exact solutions are not always needed. Several heuristics have been developed to provide near-optimal solutions; however, they do not typically scale well with the size of the graph. We propose a low-complexity approach for identifying a (possibly much smaller) subgraph of the original graph where the heuristics can be run in reasonable time and with a high likelihood of finding a global near-optimal solution. The core component of our approach is LeNSE, a reinforcement learning algorithm that learns how to navigate the space of possible subgraphs using an Euclidean subgraph embedding as its map. To solve CO problems, LeNSE is provided with a discriminative embedding trained using any existing heuristics using only on a small portion of the original graph. When tested on three problems (vertex cover, max-cut and influence maximisation) using real graphs with up to $10$ million edges, LeNSE identifies small subgraphs yielding solutions comparable to those found by running the heuristics on the entire graph, but at a fraction of the total run time.
    A Unified Approach to Synchronization Problems over Subgroups of the Orthogonal Group. (arXiv:2009.07514v2 [math.OC] UPDATED)
    The problem of synchronization over a group $\mathcal{G}$ aims to estimate a collection of group elements $G^*_1, \dots, G^*_n \in \mathcal{G}$ based on noisy observations of a subset of all pairwise ratios of the form $G^*_i {G^*_j}^{-1}$. Such a problem has gained much attention recently and finds many applications across a wide range of scientific and engineering areas. In this paper, we consider the class of synchronization problems in which the group is a closed subgroup of the orthogonal group. This class covers many group synchronization problems that arise in practice. Our contribution is fivefold. First, we propose a unified approach for solving this class of group synchronization problems, which consists of a suitable initialization step and an iterative refinement step based on the generalized power method, and show that it enjoys a strong theoretical guarantee on the estimation error under certain assumptions on the group, measurement graph, noise, and initialization. Second, we formulate two geometric conditions that are required by our approach and show that they hold for various practically relevant subgroups of the orthogonal group. The conditions are closely related to the error-bound geometry of the subgroup -- an important notion in optimization. Third, we verify the assumptions on the measurement graph and noise for standard random graph and random matrix models. Fourth, based on the classic notion of metric entropy, we develop and analyze a novel spectral-type estimator. Finally, we show via extensive numerical experiments that our proposed non-convex approach outperforms existing approaches in terms of computational speed, scalability, and/or estimation error.
    Global and Individualized Community Detection in Inhomogeneous Multilayer Networks. (arXiv:2012.00933v3 [math.ST] UPDATED)
    In network applications, it has become increasingly common to obtain datasets in the form of multiple networks observed on the same set of subjects, where each network is obtained in a related but different experiment condition or application scenario. Such datasets can be modeled by multilayer networks where each layer is a separate network itself while different layers are associated and share some common information. The present paper studies community detection in a stylized yet informative inhomogeneous multilayer network model. In our model, layers are generated by different stochastic block models, the community structures of which are (random) perturbations of a common global structure while the connecting probabilities in different layers are not related. Focusing on the symmetric two block case, we establish minimax rates for both global estimation of the common structure and individualized estimation of layer-wise community structures. Both minimax rates have sharp exponents. In addition, we provide an efficient algorithm that is simultaneously asymptotic minimax optimal for both estimation tasks under mild conditions. The optimal rates depend on the parity of the number of most informative layers, a phenomenon that is caused by inhomogeneity across layers. The method is extended to handle multiple and potentially asymmetric community cases. We demonstrate its effectiveness on both simulated examples and a real multi-modal single-cell dataset.
    Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks. (arXiv:2204.00888v2 [cs.LG] UPDATED)
    With the recent prevalence of reinforcement learning (RL), there have been tremendous interests in utilizing RL for ads allocation in recommendation platforms (e.g., e-commerce and news feed sites). To achieve better allocation, the input of recent RL-based ads allocation methods is upgraded from point-wise single item to list-wise item arrangement. However, this also results in a high-dimensional space of state-action pairs, making it difficult to learn list-wise representations with good generalization ability. This further hinders the exploration of RL agents and causes poor sample efficiency. To address this problem, we propose a novel RL-based approach for ads allocation which learns better list-wise representations by leveraging task-specific signals on Meituan food delivery platform. Specifically, we propose three different auxiliary tasks based on reconstruction, prediction, and contrastive learning respectively according to prior domain knowledge on ads allocation. We conduct extensive experiments on Meituan food delivery platform to evaluate the effectiveness of the proposed auxiliary tasks. Both offline and online experimental results show that the proposed method can learn better list-wise representations and achieve higher revenue for the platform compared to the state-of-the-art baselines.
    Kernel Normalized Convolutional Networks. (arXiv:2205.10089v1 [cs.LG])
    Existing deep convolutional neural network (CNN) architectures frequently rely upon batch normalization (BatchNorm) to effectively train the model. BatchNorm significantly improves model performance, but performs poorly with smaller batch sizes. To address this limitation, we propose kernel normalization and kernel normalized convolutional layers, and incorporate them into kernel normalized convolutional networks (KNConvNets) as the main building blocks. We implement KNConvNets corresponding to the state-of-the-art CNNs such as ResNet and DenseNet while forgoing BatchNorm layers. Through extensive experiments, we illustrate that KNConvNets consistently outperform their batch, group, and layer normalized counterparts in terms of both accuracy and convergence rate while maintaining competitive computational efficiency.
    Edge Rewiring Goes Neural: Boosting Network Resilience without Rich Features. (arXiv:2110.09035v2 [cs.LG] UPDATED)
    Improving the resilience of a network is a fundamental problem in network science, which protects the underlying system from natural disasters and malicious attacks. This is traditionally achieved via successive degree-preserving edge rewiring operations, with the major limitation of being transductive. Inductively solving graph-related tasks with sequential actions is accomplished by adopting graph neural networks (GNNs) coupled with reinforcement learning under the scenario with rich graph features. However, such frameworks cannot be directly applied to resilience tasks where only pure topological structure is available. In this case, GNNs can barely learn useful information, resulting in prohibitive difficulty in making actions for successively rewiring edges under a reinforcement learning context.In this paper, we study in depth the reasons why typical GNNs cause such failure. Based on this investigation, we propose \textbf{ResiNet}, the first end-to-end trainable inductive framework to discover \textbf{Resi}lient \textbf{Net}work topologies while balancing network utility. To this end, we reformulate resilience optimization as an MDP equipped with edge rewiring action space, and propose a pure topology-oriented variant of GNN called \textbf{Fi}lt\textbf{r}ation \textbf{e}nhanced \textbf{G}raph \textbf{N}eural \textbf{N}etwork (\textbf{FireGNN}), which can learn from graphs without rich features. Extensive experiments demonstrate that ResiNet achieves a near-optimal resilience gain on various graphs while balancing the utility, and outperforms existing approaches by a large margin.
    Proposition-Level Clustering for Multi-Document Summarization. (arXiv:2112.08770v2 [cs.CL] UPDATED)
    Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Particularly, clusters were leveraged to indicate information saliency as well as to avoid redundancy. Such prior methods focused on clustering sentences, even though closely related sentences usually contain also non-aligned parts. In this work, we revisit the clustering approach, grouping together sub-sentential propositions, aiming at more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster via text fusion. Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.
    Time Series Anomaly Detection via Reinforcement Learning-Based Model Selection. (arXiv:2205.09884v1 [cs.LG])
    Time series anomaly detection is of critical importance for the reliable and efficient operation of real-world systems. Many anomaly detection models have been developed throughout the years based on various assumptions regarding anomaly characteristics. However, due to the complex nature of real-world data, different anomalies within a time series usually have diverse profiles supporting different anomaly assumptions, making it difficult to find a single anomaly detector that can consistently beat all other models. In this work, to harness the benefits of different base models, we assume that a pool of anomaly detection models is accessible and propose to utilize reinforcement learning to dynamically select a candidate model from these base models. Experiments on real-world data have been implemented. It is demonstrated that the proposed strategy can outperforms all baseline models in terms of overall performance.
    Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space. (arXiv:2205.10316v1 [cs.AI])
    Intrinsic motivation generates behaviors that do not necessarily lead to immediate reward, but help exploration and learning. Here we show that agents having the sole goal of maximizing occupancy of future actions and states, that is, moving and exploring on the long term, are capable of complex behavior without any reference to external rewards. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy with the optimal state-value function, from where we prove uniqueness of the solution of the associated Bellman equation and convergence of our algorithm to the optimal state-value function. Using discrete and continuous state tasks, we show that `dancing', hide-and-seek and a basic form of altruistic behavior naturally result from entropy seeking without external rewards. Intrinsically motivated agents can objectively determine what states constitute rewards, exploiting them to ultimately maximize action-state path entropy.
    Concurrent Policy Blending and System Identification for Generalized Assistive Control. (arXiv:2205.09836v1 [cs.RO])
    In this work, we address the problem of solving complex collaborative robotic tasks subject to multiple varying parameters. Our approach combines simultaneous policy blending with system identification to create generalized policies that are robust to changes in system parameters. We employ a blending network whose state space relies solely on parameter estimates from a system identification technique. As a result, this blending network learns how to handle parameter changes instead of trying to learn how to solve the task for a generalized parameter set simultaneously. We demonstrate our scheme's ability on a collaborative robot and human itching task in which the human has motor impairments. We then showcase our approach's efficiency with a variety of system identification techniques when compared to standard domain randomization.
    Converting Artificial Neural Networks to Spiking Neural Networks via Parameter Calibration. (arXiv:2205.10121v1 [cs.NE])
    Spiking Neural Network (SNN), originating from the neural behavior in biology, has been recognized as one of the next-generation neural networks. Conventionally, SNNs can be obtained by converting from pre-trained Artificial Neural Networks (ANNs) by replacing the non-linear activation with spiking neurons without changing the parameters. In this work, we argue that simply copying and pasting the weights of ANN to SNN inevitably results in activation mismatch, especially for ANNs that are trained with batch normalization (BN) layers. To tackle the activation mismatch issue, we first provide a theoretical analysis by decomposing local conversion error to clipping error and flooring error, and then quantitatively measure how this error propagates throughout the layers using the second-order analysis. Motivated by the theoretical results, we propose a set of layer-wise parameter calibration algorithms, which adjusts the parameters to minimize the activation mismatch. Extensive experiments for the proposed algorithms are performed on modern architectures and large-scale tasks including ImageNet classification and MS COCO detection. We demonstrate that our method can handle the SNN conversion with batch normalization layers and effectively preserve the high accuracy even in 32 time steps. For example, our calibration algorithms can increase up to 65% accuracy when converting VGG-16 with BN layers.
    Personalized Federated Learning with Adaptive Batchnorm for Healthcare. (arXiv:2112.00734v3 [cs.LG] UPDATED)
    There is a growing interest in applying machine learning techniques to healthcare. Recently, federated learning (FL) is gaining popularity since it allows researchers to train powerful models without compromising data privacy and security. However, the performance of existing FL approaches often deteriorates when encountering non-iid situations where there exist distribution gaps among clients, and few previous efforts focus on personalization in healthcare. In this article, we propose FedAP to tackle domain shifts and then obtain personalized models for local clients. FedAP learns the similarity between clients based on the statistics of the batch normalization layers while preserving the specificity of each client with different local batch normalization. Comprehensive experiments on five healthcare benchmarks demonstrate that FedAP achieves better accuracy compared to state-of-the-art methods (e.g., 10% accuracy improvement for PAMAP2) with faster convergence speed.
    Lossless Acceleration for Seq2seq Generation with Aggressive Decoding. (arXiv:2205.10350v1 [cs.CL])
    We study lossless acceleration for seq2seq generation with a novel decoding algorithm -- Aggressive Decoding. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, our approach aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup, achieved by innovative cooperation of aggressive decoding and verification that are both efficient due to parallel computing. We propose two Aggressive Decoding paradigms for 2 kinds of seq2seq tasks: 1) For the seq2seq tasks whose inputs and outputs are highly similar (e.g., Grammatical Error Correction), we propose Input-guided Aggressive Decoding (IAD) that aggressively copies from the input sentence as drafted decoded tokens to verify in parallel; 2) For other general seq2seq tasks (e.g., Machine Translation), we propose Generalized Aggressive Decoding (GAD) that first employs an additional non-autoregressive decoding model for aggressive decoding and then verifies in parallel in the autoregressive manner. We test Aggressive Decoding on the most popular 6-layer Transformer model on GPU in multiple seq2seq tasks: 1) For IAD, we show that it can introduce a 7x-9x speedup for the Transformer in Grammatical Error Correction and Text Simplification tasks with the identical results as greedy decoding; 2) For GAD, we observe a 3x-5x speedup with the identical or even better quality in two important seq2seq tasks: Machine Translation and Abstractive Summarization. Moreover, Aggressive Decoding can benefit even more from stronger computing devices that are better at parallel computing. Given the lossless quality as well as significant and promising speedup, we believe Aggressive Decoding may potentially evolve into a de facto standard for efficient and lossless seq2seq generation in the near future.
    Anomaly Detection for Multivariate Time Series on Large-scale Fluid Handling Plant Using Two-stage Autoencoder. (arXiv:2205.09924v1 [cs.LG])
    This paper focuses on anomaly detection for multivariate time series data in large-scale fluid handling plants with dynamic components, such as power generation, water treatment, and chemical plants, where signals from various physical phenomena are observed simultaneously. In these plants, the need for anomaly detection techniques is increasing in order to reduce the cost of operation and maintenance, in view of a decline in the number of skilled engineers and a shortage of manpower. However, considering the complex behavior of high-dimensional signals and the demand for interpretability, the techniques constitute a major challenge. We introduce a Two-Stage AutoEncoder (TSAE) as an anomaly detection method suitable for such plants. This is a simple autoencoder architecture that makes anomaly detection more interpretable and more accurate, in which based on the premise that plant signals can be separated into two behaviors that have almost no correlation with each other, the signals are separated into long-term and short-term components in a stepwise manner, and the two components are trained independently to improve the inference capability for normal signals. Through experiments on two publicly available datasets of water treatment systems, we have confirmed the high detection performance, the validity of the premise, and that the model behavior was as intended, i.e., the technical effectiveness of TSAE.
    SADAM: Stochastic Adam, A Stochastic Operator for First-Order Gradient-based Optimizer. (arXiv:2205.10247v1 [cs.LG])
    In this work, to efficiently help escape the stationary and saddle points, we propose, analyze, and generalize a stochastic strategy performed as an operator for a first-order gradient descent algorithm in order to increase the target accuracy and reduce time consumption. Unlike existing algorithms, the proposed stochastic the strategy does not require any batches and sampling techniques, enabling efficient implementation and maintaining the initial first-order optimizer's convergence rate, but provides an incomparable improvement of target accuracy when optimizing the target functions. In short, the proposed strategy is generalized, applied to Adam, and validated via the decomposition of biomedical signals using Deep Matrix Fitting and another four peer optimizers. The validation results show that the proposed random strategy can be easily generalized for first-order optimizers and efficiently improve the target accuracy.
    Cross DQN: Cross Deep Q Network for Ads Allocation in Feed. (arXiv:2109.04353v4 [cs.LG] UPDATED)
    E-commerce platforms usually display a mixed list of ads and organic items in feed. One key problem is to allocate the limited slots in the feed to maximize the overall revenue as well as improve user experience, which requires a good model for user preference. Instead of modeling the influence of individual items on user behaviors, the arrangement signal models the influence of the arrangement of items and may lead to a better allocation strategy. However, most of previous strategies fail to model such a signal and therefore result in suboptimal performance. In addition, the percentage of ads exposed (PAE) is an important indicator in ads allocation. Excessive PAE hurts user experience while too low PAE reduces platform revenue. Therefore, how to constrain the PAE within a certain range while keeping personalized recommendation under the PAE constraint is a challenge. In this paper, we propose Cross Deep Q Network (Cross DQN) to extract the crucial arrangement signal by crossing the embeddings of different items and modeling the crossed sequence by multi-channel attention. Besides, we propose an auxiliary loss for batch-level constraint on PAE to tackle the above-mentioned challenge. Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on Meituan feed to serve more than 300 millions of customers.
    Learning Interface Conditions in Domain Decomposition Solvers. (arXiv:2205.09833v1 [cs.LG])
    Domain decomposition methods are widely used and effective in the approximation of solutions to partial differential equations. Yet the optimal construction of these methods requires tedious analysis and is often available only in simplified, structured-grid settings, limiting their use for more complex problems. In this work, we generalize optimized Schwarz domain decomposition methods to unstructured-grid problems, using Graph Convolutional Neural Networks (GCNNs) and unsupervised learning to learn optimal modifications at subdomain interfaces. A key ingredient in our approach is an improved loss function, enabling effective training on relatively small problems, but robust performance on arbitrarily large problems, with computational cost linear in problem size. The performance of the learned linear solvers is compared with both classical and optimized domain decomposition algorithms, for both structured- and unstructured-grid problems.
    On Jointly Optimizing Partial Offloading and SFC Mapping: A Cooperative Dual-agent Deep Reinforcement Learning Approach. (arXiv:2205.09925v1 [cs.AI])
    Multi-access edge computing (MEC) and network function virtualization (NFV) are promising technologies to support emerging IoT applications, especially those computation-intensive. In NFV-enabled MEC environment, service function chain (SFC), i.e., a set of ordered virtual network functions (VNFs), can be mapped on MEC servers. Mobile devices (MDs) can offload computation-intensive applications, which can be represented by SFCs, fully or partially to MEC servers for remote execution. This paper studies the partial offloading and SFC mapping joint optimization (POSMJO) problem in an NFV-enabled MEC system, where an incoming task can be partitioned into two parts, one for local execution and the other for remote execution. The objective is to minimize the average cost in the long term which is a combination of execution delay, MD's energy consumption, and usage charge for edge computing. This problem consists of two closely related decision-making steps, namely task partition and VNF placement, which is highly complex and quite challenging. To address this, we propose a cooperative dual-agent deep reinforcement learning (CDADRL) algorithm, where we design a framework enabling interaction between two agents. Simulation results show that the proposed algorithm outperforms three combinations of deep reinforcement learning algorithms in terms of cumulative and average episodic rewards and it overweighs a number of baseline algorithms with respect to execution delay, energy consumption, and usage charge.
    Almost exact recovery in noisy semi-supervised learning. (arXiv:2007.14717v3 [cs.LG] UPDATED)
    Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
    Long-Range Transformers for Dynamic Spatiotemporal Forecasting. (arXiv:2109.12218v2 [cs.LG] UPDATED)
    Multivariate Time Series Forecasting focuses on the prediction of future values based on historical context. State-of-the-art sequence-to-sequence models rely on neural attention between timesteps, which allows for temporal learning but fails to consider distinct spatial relationships between variables. In contrast, methods based on graph neural networks explicitly model variable relationships. However, these methods often rely on predefined graphs and perform separate spatial and temporal updates without establishing direct connections between each variable at every timestep. This paper addresses these problems by translating multivariate forecasting into a spatiotemporal sequence formulation where each Transformer input token represents the value of a single variable at a given time. Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence. Our method, which we call Spacetimeformer, achieves competitive results on benchmarks from traffic forecasting to electricity demand and weather prediction while learning fully-connected spatiotemporal relationships purely from data.
    EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. (arXiv:2202.05146v3 [q-bio.BM] UPDATED)
    Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.
    Graph Representation Learning for Multi-Task Settings: a Meta-Learning Approach. (arXiv:2201.03326v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have become the state-of-the-art method for many applications on graph structured data. GNNs are a model for graph representation learning, which aims at learning to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. While this approach achieves great results in the single-task setting, the generation of node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is still an open problem. We propose the use of meta-learning to allow the training of a GNN model capable of producing multi-task node embeddings. In particular, we exploit the properties of optimization-based meta-learning to learn GNNs that can produce general node representations by learning parameters that can quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks. Our experiments show that the embeddings produced by a model trained with our purposely designed meta-learning procedure can be used to perform multiple tasks with comparable or, surprisingly, even higher performance than both single-task and multi-task end-to-end models.
    MCMARL: Parameterizing Value Function via Mixture of Categorical Distributions for Multi-Agent Reinforcement Learning. (arXiv:2202.10134v2 [cs.LG] UPDATED)
    In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns and the randomness can be exacerbated with the increasing number of agents. However, such randomness is ignored by most of the existing value-based multi-agent reinforcement learning (MARL) methods, which only model the expectation of Q-value for both individual agents and the team. Compared to using the expectations of the long-term returns, it is preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this work proposes a novel value-based MARL framework from a distributional perspective, \emph{i.e.}, parameterizing value function via \underline{M}ixture of \underline{C}ategorical distributions for MARL. Specifically, we model both individual Q-values and global Q-value with categorical distribution. To integrate categorical distributions, we define five basic operations on the distribution, which allow the generalization of expected value function factorization methods (\emph{e.g.}, VDN and QMIX) to their MCMARL variants. We further prove that our MCMARL framework satisfies \emph{Distributional-Individual-Global-Max} (DIGM) principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global Q-value and individual Q-values. Empirically, we evaluate MCMARL on both a stochastic matrix game and a challenging set of StarCraft II micromanagement tasks, showing the efficacy of our framework.
    Revisiting GANs by Best-Response Constraint: Perspective, Methodology, and Application. (arXiv:2205.10146v1 [cs.LG])
    In past years, the minimax type single-level optimization formulation and its variations have been widely utilized to address Generative Adversarial Networks (GANs). Unfortunately, it has been proved that these alternating learning strategies cannot exactly reveal the intrinsic relationship between the generator and discriminator, thus easily result in a series of issues, including mode collapse, vanishing gradients and oscillations in the training phase, etc. In this work, by investigating the fundamental mechanism of GANs from the perspective of hierarchical optimization, we propose Best-Response Constraint (BRC), a general learning framework, that can explicitly formulate the potential dependency of the generator on the discriminator. Rather than adopting these existing time-consuming bilevel iterations, we design an implicit gradient scheme with outer-product Hessian approximation as our fast solution strategy. \emph{Noteworthy, we demonstrate that even with different motivations and formulations, a variety of existing GANs ALL can be uniformly improved by our flexible BRC methodology.} Extensive quantitative and qualitative experimental results verify the effectiveness, flexibility and stability of our proposed framework.
    Towards the Generation of Synthetic Images of Palm Vein Patterns: A Review. (arXiv:2205.10179v1 [cs.CV])
    With the recent success of computer vision and deep learning, remarkable progress has been achieved on automatic personal recognition using vein biometrics. However, collecting large-scale real-world training data for palm vein recognition has turned out to be challenging, mainly due to the noise and irregular variations included at the time of acquisition. Meanwhile, existing palm vein recognition datasets are usually collected under near-infrared light, lacking detailed annotations on attributes (e.g., pose), so the influences of different attributes on vein recognition have been poorly investigated. Therefore, this paper examines the suitability of synthetic vein images generated to compensate for the urgent lack of publicly available large-scale datasets. Firstly, we present an overview of recent research progress on palm vein recognition, from the basic background knowledge to vein anatomical structure, data acquisition, public database, and quality assessment procedures. Then, we focus on the state-of-the-art methods that have allowed the generation of vascular structures for biometric purposes and the modeling of biological networks with their respective application domains. In addition, we review the existing research on the generation of style transfer and biological nature-based synthetic palm vein image algorithms. Afterward, we formalize a general flowchart for the creation of a synthetic database comparing real palm vein images and generated synthetic samples to obtain some understanding into the development of the realistic vein imaging system. Ultimately, we conclude by discussing the challenges, insights, and future perspectives in generating synthetic palm vein images for further works.
    A Computational Framework of Cortical Microcircuits Approximates Sign-concordant Random Backpropagation. (arXiv:2205.07292v2 [cs.NE] UPDATED)
    Several recent studies attempt to address the biological implausibility of the well-known backpropagation (BP) method. While promising methods such as feedback alignment, direct feedback alignment, and their variants like sign-concordant feedback alignment tackle BP's weight transport problem, their validity remains controversial owing to a set of other unsolved issues. In this work, we answer the question of whether it is possible to realize random backpropagation solely based on mechanisms observed in neuroscience. We propose a hypothetical framework consisting of a new microcircuit architecture and its supporting Hebbian learning rules. Comprising three types of cells and two types of synaptic connectivity, the proposed microcircuit architecture computes and propagates error signals through local feedback connections and supports the training of multi-layered spiking neural networks with a globally defined spiking error function. We employ the Hebbian rule operating in local compartments to update synaptic weights and achieve supervised learning in a biologically plausible manner. Finally, we interpret the proposed framework from an optimization point of view and show its equivalence to sign-concordant feedback alignment. The proposed framework is benchmarked on several datasets including MNIST and CIFAR10, demonstrating promising BP-comparable accuracy.
    Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. (arXiv:2110.08412v2 [cs.CL] UPDATED)
    To explain NLP models, importance measures such as attention inform which inputs tokens are important for a prediction are popular. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose an new faithfulness benchmark called Recursive ROAR. This works by recursively masking allegedly important tokens and then retrain the model. The principle is, that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using the area-between-curves, which allows for easy comparison across papers, models, and tasks. To provide a thorough review, we evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.
    AutoFedNLP: An efficient FedNLP framework. (arXiv:2205.10162v1 [cs.LG])
    Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often require private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and client resources. A silver-bullet configuration does not exist and a non-optimal configuration could significantly slow down the training. To automate adapter configuration, we propose AutoFedNLP, a framework that enhances the existing FedNLP with two novel designs. First, AutoFedNLP progressively upgrades the adapter configuration throughout a training session. Second, AutoFedNLP continuously profiles future adapter configurations by allocating participant devices to trial groups. To minimize client-side computations, AutoFedNLP exploits the fact that a FedNLP client trains on the same samples repeatedly between consecutive changes of adapter configurations, and caches computed activations on clients. Extensive experiments show that AutoFedNLP can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5$\times$ faster compared to vanilla FedNLP and 48$\times$ faster compared to strong baselines.
    Learning Convolutional Neural Networks in the Frequency Domain. (arXiv:2204.06718v8 [cs.CV] UPDATED)
    Convolutional neural network (CNN) has achieved impressive success in computer vision during the past few decades. The image convolution operation helps CNNs to get good performance on image-related tasks. However, the image convolution has high computation complexity and hard to be implemented. This paper proposes the CEMNet, which can be trained in the frequency domain. The most important motivation of this research is that we can use the straightforward element-wise multiplication operation to replace the image convolution in the frequency domain based on the Cross-Correlation Theorem, which obviously reduces the computation complexity. We further introduce a Weight Fixation mechanism to alleviate the problem of over-fitting, and analyze the working behavior of Batch Normalization, Leaky ReLU, and Dropout in the frequency domain to design their counterparts for CEMNet. Also, to deal with complex inputs brought by Discrete Fourier Transform, we design a two-branches network structure for CEMNet. Experimental results imply that CEMNet achieves good performance on MNIST and CIFAR-10 databases.
    The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. (arXiv:2205.10144v1 [cs.CV])
    In laboratory object recognition tasks based on undistorted photographs, both adult humans and Deep Neural Networks (DNNs) perform close to ceiling. Unlike adults', whose object recognition performance is robust against a wide range of image distortions, DNNs trained on standard ImageNet (1.3M images) perform poorly on distorted images. However, the last two years have seen impressive gains in DNN distortion robustness, predominantly achieved through ever-increasing large-scale datasets$\unicode{x2014}$orders of magnitude larger than ImageNet. While this simple brute-force approach is very effective in achieving human-level robustness in DNNs, it raises the question of whether human robustness, too, is simply due to extensive experience with (distorted) visual input during childhood and beyond. Here we investigate this question by comparing the core object recognition performance of 146 children (aged 4$\unicode{x2013}$15) against adults and against DNNs. We find, first, that already 4$\unicode{x2013}$6 year-olds showed remarkable robustness to image distortions and outperform DNNs trained on ImageNet. Second, we estimated the number of $\unicode{x201C}$images$\unicode{x201D}$ children have been exposed to during their lifetime. Compared to various DNNs, children's high robustness requires relatively little data. Third, when recognizing objects children$\unicode{x2014}$like adults but unlike DNNs$\unicode{x2014}$rely heavily on shape but not on texture cues. Together our results suggest that the remarkable robustness to distortions emerges early in the developmental trajectory of human object recognition and is unlikely the result of a mere accumulation of experience with distorted visual input. Even though current DNNs match human performance regarding robustness they seem to rely on different and more data-hungry strategies to do so.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v3 [cs.LG] UPDATED)
    A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communication cost, which is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on Stack Overflow next-word prediction, a realistic and challenging FL benchmark. This is achieved by examining the problem using rate-distortion theory, and proposing distortion as a reliable proxy for model accuracy. Distortion can be more effectively used for optimizing the trade-off between model performance and communication cost across clients. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds.
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v2 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
    Task Relabelling for Multi-task Transfer using Successor Features. (arXiv:2205.10175v1 [cs.AI])
    Deep Reinforcement Learning has been very successful recently with various works on complex domains. Most works are concerned with learning a single policy that solves the target task, but is fixed in the sense that if the environment changes the agent is unable to adapt to it. Successor Features (SFs) proposes a mechanism that allows learning policies that are not tied to any particular reward function. In this work we investigate how SFs may be pre-trained without observing any reward in a custom environment that features resource collection, traps and crafting. After pre-training we expose the SF agents to various target tasks and see how well they can transfer to new tasks. Transferring is done without any further training on the SF agents, instead just by providing a task vector. For training the SFs we propose a task relabelling method which greatly improves the agent's performance.
    The Fairness of Credit Scoring Models. (arXiv:2205.10200v1 [stat.ML])
    In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they also often discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. In this paper, we show how (1) to test whether there exists a statistically significant difference between protected and unprotected groups, which we call lack of fairness and (2) to identify the variables that cause the lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, and improved for the benefit of protected groups.
    Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze?. (arXiv:2205.10226v1 [cs.CL])
    Learned self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention. We compare attention functions across two task-specific reading datasets for sentiment analysis and relation extraction. We find the predictiveness of large-scale pre-trained self-attention for human attention depends on `what is in the tail', e.g., the syntactic nature of rare contexts. Further, we observe that task-specific fine-tuning does not increase the correlation with human task-specific reading. Through an input reduction experiment we give complementary insights on the sparsity and fidelity trade-off, showing that lower-entropy attention vectors are more faithful.
    A Novel Weighted Ensemble Learning Based Agent for the Werewolf Game. (arXiv:2205.09813v1 [cs.LG])
    Werewolf is a popular party game throughout the world, and research on its significance has progressed in recent years. The Werewolf game is based on conversation, and in order to win, participants must use all of their cognitive abilities. This communication game requires the playing agents to be very sophisticated to win. In this research, we generated a sophisticated agent to play the Werewolf game using a complex weighted ensemble learning approach. This research work aimed to estimate what other agents/players think of us in the game. The agent was developed by aggregating strategies of different participants in the AI Wolf competition and thereby learning from them using machine learning. Moreover, the agent created was able to perform much better than other competitors using very basic strategies to show the approach's effectiveness in the Werewolf game. The machine learning technique used here is not restricted to the Werewolf game but may be extended to any game that requires communication and action depending on other participants.
    Linearizing Transformer with Key-Value Memory. (arXiv:2203.12644v3 [cs.CL] UPDATED)
    Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a lightweight multi-head mechanism which renders the computation as light as a single-head model. We demonstrate that MemSizer provides an improved balance between efficiency and accuracy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.
    GNN-Geo: A Graph Neural Network-based Fine-grained IP geolocation Framework. (arXiv:2112.10767v6 [cs.LG] UPDATED)
    Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks which do not follow hypothetical rules. Recently, deep learning methods, like multi-layer perceptron (MLP), are tried to increase generalization capabilities. However, MLP is not so suitable for graph-structured data like networks. MLP treats IP addresses as isolated instances and ignores the connection information, which limits geolocation accuracy. In this work, we research how to increase the generalization capability with an emerging graph deep learning method - Graph Neural Network (GNN). First, IP geolocation is re-formulated as an attributed graph node regression problem. Then, we propose a GNN-based IP geolocation framework named GNN-Geo. GNN-Geo consists of a preprocessor, an encoder, messaging passing (MP) layers and a decoder. The preprocessor and encoder transform measurement data into the initial node embeddings. MP layers refine the initial node embeddings by modeling the connection information. The decoder maps the refined embeddings to nodes' locations and relieves the convergence problem of GNN by considering prior knowledge. The experiments in different real-world datasets show: the proposed GNN-Geo outperforms the state-of-art rule-based and learning-based baselines on all datasets w.r.t median error distance by 16% to 28%. This work verifies the great potential of GNN for fine-grained IP geolocation.
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v3 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information on consumers, effective personalization of offers, goods, and services has become a core business focus for companies to improve revenues and maintain competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedure. Importantly, in many business and medical settings, interpretability of a policy is essential. To address these challenges, we study the class of policies with linear decision boundaries and propose learning algorithms using tools from causal inference. We propose several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that an adapted Bayesian optimization algorithm is fast and effective. We test our algorithm with extensive simulation studies and apply it to an online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process, e.g. when "Attribute 2" is large, marginal increase in GMV is low for discounts higher than 10\%. Our findings suggest that the proposed policy learning algorithm provides a promising practical approach for interpretable personalization across a wide range of applications.
    Counterfactual Temporal Point Processes. (arXiv:2111.07603v2 [cs.LG] UPDATED)
    Machine learning models based on temporal point processes are the state of the art in a wide variety of applications involving discrete events in continuous time. However, these models lack the ability to answer counterfactual questions, which are increasingly relevant as these models are being used to inform targeted interventions. In this work, our goal is to fill this gap. To this end, we first develop a causal model of thinning for temporal point processes that builds upon the Gumbel-Max structural causal model. This model satisfies a desirable counterfactual monotonicity condition, which is sufficient to identify counterfactual dynamics in the process of thinning. Then, given an observed realization of a temporal point process with a given intensity function, we develop a sampling algorithm that uses the above causal model of thinning and the superposition theorem to simulate counterfactual realizations of the temporal point process under a given alternative intensity function. Simulation experiments using synthetic and real epidemiological data show that the counterfactual realizations provided by our algorithm may give valuable insights to enhance targeted interventions.
    On Algorithmic Stability in Unsupervised Representation Learning. (arXiv:2106.05238v3 [cs.LG] UPDATED)
    In this paper, we investigate the algorithmic stability of unsupervised representation learning with deep generative models, as a function of repeated re-training on the same input data. Algorithms for learning low dimensional linear representations -- for example principal components analysis (PCA), or linear independent components analysis (ICA) -- come with guarantees that they will always reveal the same latent representations (perhaps up to an arbitrary rotation or permutation). Unfortunately, for non-linear representation learning, such as in a variational auto-encoder (VAE) model trained by stochastic gradient descent, we have no such guarantees. Recent work on identifiability in non-linear ICA have introduced a family of deep generative models that have identifiable latent representations, achieved by conditioning on side information (e.g. informative labels). We empirically evaluate the stability of these models under repeated re-estimation of parameters, and compare them to both standard VAEs and deep generative models which learn to cluster in their latent space. Surprisingly, we discover side information is not necessary for algorithmic stability: using standard quantitative measures of identifiability, we find deep generative models with latent clusterings are empirically identifiable to the same degree as models which rely on auxiliary labels. We relate these results to the possibility of identifiable non-linear ICA.
    CertiFair: A Framework for Certified Global Fairness of Neural Networks. (arXiv:2205.09927v1 [cs.LG])
    We consider the problem of whether a Neural Network (NN) model satisfies global individual fairness. Individual Fairness suggests that similar individuals with respect to a certain task are to be treated similarly by the decision model. In this work, we have two main objectives. The first is to construct a verifier which checks whether the fairness property holds for a given NN in a classification task or provide a counterexample if it is violated, i.e., the model is fair if all similar individuals are classified the same, and unfair if a pair of similar individuals are classified differently. To that end, We construct a sound and complete verifier that verifies global individual fairness properties of ReLU NN classifiers using distance-based similarity metrics. The second objective of this paper is to provide a method for training provably fair NN classifiers from unfair (biased) data. We propose a fairness loss that can be used during training to enforce fair outcomes for similar individuals. We then provide provable bounds on the fairness of the resulting NN. We run experiments on commonly used fairness datasets that are publicly available and we show that global individual fairness can be improved by 96 % without significant drop in test accuracy.
    Unintended memorisation of unique features in neural networks. (arXiv:2205.10079v1 [cs.LG])
    Neural networks pose a privacy risk due to their propensity to memorise and leak training data. We show that unique features occurring only once in training data are memorised by discriminative multi-layer perceptrons and convolutional neural networks trained on benchmark imaging datasets. We design our method for settings where sensitive training data is not available, for example medical imaging. Our setting knows the unique feature, but not the training data, model weights or the unique feature's label. We develop a score estimating a model's sensitivity to a unique feature by comparing the KL divergences of the model's output distributions given modified out-of-distribution images. We find that typical strategies to prevent overfitting do not prevent unique feature memorisation. And that images containing a unique feature are highly influential, regardless of the influence the images's other features. We also find a significant variation in memorisation with training seed. These results imply that neural networks pose a privacy risk to rarely occurring private information. This risk is more pronounced in healthcare applications since sensitive patient information can be memorised when it remains in training data due to an imperfect data sanitisation process.
    Unsupervised Out-of-Domain Detection via Pre-trained Transformers. (arXiv:2106.00948v2 [cs.CL] UPDATED)
    Deployed real-world machine learning applications are often subject to uncontrolled and even potentially malicious inputs. Those out-of-domain inputs can lead to unpredictable outputs and sometimes catastrophic safety issues. Prior studies on out-of-domain detection require in-domain task labels and are limited to supervised classification scenarios. Our work tackles the problem of detecting out-of-domain samples with only unsupervised in-domain data. We utilize the latent representations of pre-trained transformers and propose a simple yet effective method to transform features across all layers to construct out-of-domain detectors efficiently. Two domain-specific fine-tuning approaches are further proposed to boost detection accuracy. Our empirical evaluations of related methods on two datasets validate that our method greatly improves out-of-domain detection ability in a more general scenario.
    You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas. (arXiv:2205.10228v1 [cs.CL])
    Social chatbots, also known as chit-chat chatbots, evolve rapidly with large pretrained language models. Despite the huge progress, privacy concerns have arisen recently: training data of large language models can be extracted via model inversion attacks. On the other hand, the datasets used for training chatbots contain many private conversations between two individuals. In this work, we further investigate the privacy leakage of the hidden states of chatbots trained by language modeling which has not been well studied yet. We show that speakers' personas can be inferred through a simple neural network with high accuracy. To this end, we propose effective defense objectives to protect persona leakage from hidden states. We conduct extensive experiments to demonstrate that our proposed defense objectives can greatly reduce the attack accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve language models' powerful generation ability.
    Policies for the Dynamic Traveling Maintainer Problem with Alerts. (arXiv:2105.15119v2 [math.OC] UPDATED)
    Downtime of industrial assets such as wind turbines and medical imaging devices comes at a sharp cost. To avoid such downtime costs, companies seek to initiate maintenance just before failure. Unfortunately, this is challenging for the following two reasons: On the one hand, because asset failures are notoriously difficult to predict, even in the presence of real-time monitoring devices which signal early degradation. On the other hand, because the available resources to serve a network of geographically dispersed assets are typically limited. In this paper, we propose a novel dynamic traveling maintainer problem with alerts model that incorporates these two challenges and we provide three solution approaches on how to dispatch the limited resources. Namely, we propose: (i) Greedy heuristic approaches that rank assets on urgency, proximity and economic risk; (ii) A novel traveling maintainer heuristic approach that optimizes short-term costs; and (iii) A deep reinforcement learning (DRL) approach that optimizes long-term costs. Each approach has different requirements concerning the available alert information. Experiments with small asset networks show that all methods can approximate the optimal policy when given access to complete condition information. For larger networks, the proposed methods yield competitive policies, with DRL consistently achieving the lowest costs.
    A Unified Experiment Design Approach for Cyclic and Acyclic Causal Models. (arXiv:2205.10083v1 [cs.LG])
    We study experiment design for the unique identification of the causal graph of a system where the graph may contain cycles. The presence of cycles in the structure introduces major challenges for experiment design. Unlike the case of acyclic graphs, learning the skeleton of the causal graph from observational distribution may not be possible. Furthermore, intervening on a variable does not necessarily lead to orienting all the edges incident to it. In this paper, we propose an experiment design approach that can learn both cyclic and acyclic graphs and hence, unifies the task of experiment design for both types of graphs. We provide a lower bound on the number of experiments required to guarantee the unique identification of the causal graph in the worst case, showing that the proposed approach is order-optimal in terms of the number of experiments up to an additive logarithmic term. Moreover, we extend our result to the setting where the size of each experiment is bounded by a constant. For this case, we show that our approach is optimal in terms of the size of the largest experiment required for the unique identification of the causal graph in the worst case.
    Confident Clustering via PCA Compression Ratio and Its Application to Single-cell RNA-seq Analysis. (arXiv:2205.09849v1 [cs.LG])
    Unsupervised clustering algorithms for vectors has been widely used in the area of machine learning. Many applications, including the biological data we studied in this paper, contain some boundary datapoints which show combination properties of two underlying clusters and could lower the performance of the traditional clustering algorithms. We develop a confident clustering method aiming to diminish the influence of these datapoints and improve the clustering results. Concretely, for a list of datapoints, we give two clustering results. The first-round clustering attempts to classify only pure vectors with high confidence. Based on it, we classify more vectors with less confidence in the second round. We validate our algorithm on single-cell RNA-seq data, which is a powerful and widely used tool in biology area. Our confident clustering shows a high accuracy on our tested datasets. In addition, unlike traditional clustering methods in single-cell analysis, the confident clustering shows high stability under different choices of parameters.
    How to Guide Adaptive Depth Sampling?. (arXiv:2205.10202v1 [cs.CV])
    Recent advances in depth sensing technologies allow fast electronic maneuvering of the laser beam, as opposed to fixed mechanical rotations. This will enable future sensors, in principle, to vary in real-time the sampling pattern. We examine here the abstract problem of whether adapting the sampling pattern for a given frame can reduce the reconstruction error or allow a sparser pattern. We propose a constructive generic method to guide adaptive depth sampling algorithms. Given a sampling budget B, a depth predictor P and a desired quality measure M, we propose an Importance Map that highlights important sampling locations. This map is defined for a given frame as the per-pixel expected value of M produced by the predictor P, given a pattern of B random samples. This map can be well estimated in a training phase. We show that a neural network can learn to produce a highly faithful Importance Map, given an RGB image. We then suggest an algorithm to produce a sampling pattern for the scene, which is denser in regions that are harder to reconstruct. The sampling strategy of our modular framework can be adjusted according to hardware limitations, type of depth predictor, and any custom reconstruction error measure that should be minimized. We validate through simulations that our approach outperforms grid and random sampling patterns as well as recent state-of-the-art adaptive algorithms.
    Generalisation effects of predictive uncertainty estimation in deep learning for digital pathology. (arXiv:2112.09693v2 [cs.LG] UPDATED)
    Deep learning (DL) has shown great potential in digital pathology applications. The robustness of a diagnostic DL-based solution is essential for safe clinical deployment. In this work we evaluate if adding uncertainty estimates for DL predictions in digital pathology could result in increased value for the clinical applications, by boosting the general predictive performance or by detecting mispredictions. We compare the effectiveness of model-integrated methods (MC dropout and Deep ensembles) with a model-agnostic approach (Test time augmentation, TTA). Moreover, four uncertainty metrics are compared. Our experiments focus on two domain shift scenarios: a shift to a different medical center and to an underrepresented subtype of cancer. Our results show that uncertainty estimates increase reliability by reducing a model's sensitivity to classification threshold selection as well as by detecting between 70\% and 90\% of the mispredictions done by the model. Overall, the deep ensembles method achieved the best performance closely followed by TTA.
    Remember and Forget Experience Replay for Multi-Agent Reinforcement Learning. (arXiv:2203.13319v2 [cs.LG] UPDATED)
    We present the extension of the Remember and Forget for Experience Replay (ReF-ER) algorithm to Multi-Agent Reinforcement Learning (MARL). {ReF-ER} was shown to outperform state of the art algorithms for continuous control in problems ranging from the OpenAI Gym to complex fluid flows. In MARL, the dependencies between the agents are included in the state-value estimator and the environment dynamics are modeled via the importance weights used by ReF-ER. In collaborative environments, we find the best performance when the value is estimated using individual rewards and we ignore the effects of other actions on the transition map. We benchmark the performance of ReF-ER MARL on the Stanford Intelligent Systems Laboratory (SISL) environments. We find that employing a single feed-forward neural network for the policy and the value function in ReF-ER MARL, outperforms state of the art algorithms that rely on complex neural network architectures.
    Improving Multi-Task Generalization via Regularizing Spurious Correlation. (arXiv:2205.09797v1 [cs.LG])
    Multi-Task Learning (MTL) is a powerful learning paradigm to improve generalization performance via knowledge sharing. However, existing studies find that MTL could sometimes hurt generalization, especially when two tasks are less correlated. One possible reason that hurts generalization is spurious correlation, i.e., some knowledge is spurious and not causally related to task labels, but the model could mistakenly utilize them and thus fail when such correlation changes. In MTL setup, there exist several unique challenges of spurious correlation. First, the risk of having non-causal knowledge is higher, as the shared MTL model needs to encode all knowledge from different tasks, and causal knowledge for one task could be potentially spurious to the other. Second, the confounder between task labels brings in a different type of spurious correlation to MTL. We theoretically prove that MTL is more prone to taking non-causal knowledge from other tasks than single-task learning, and thus generalize worse. To solve this problem, we propose Multi-Task Causal Representation Learning framework, aiming to represent multi-task knowledge via disentangled neural modules, and learn which module is causally related to each task via MTL-specific invariant regularization. Experiments show that it could enhance MTL model's performance by 5.5% on average over Multi-MNIST, MovieLens, Taskonomy, CityScape, and NYUv2, via alleviating spurious correlation problem.
    Continual learning on 3D point clouds with random compressed rehearsal. (arXiv:2205.08013v2 [cs.LG] UPDATED)
    Contemporary deep neural networks offer state-of-the-art results when applied to visual reasoning, e.g., in the context of 3D point cloud data. Point clouds are important datatype for precise modeling of three-dimensional environments, but effective processing of this type of data proves to be challenging. In the world of large, heavily-parameterized network architectures and continuously-streamed data, there is an increasing need for machine learning models that can be trained on additional data. Unfortunately, currently available models cannot fully leverage training on additional data without losing their past knowledge. Combating this phenomenon, called catastrophic forgetting, is one of the main objectives of continual learning. Continual learning for deep neural networks has been an active field of research, primarily in 2D computer vision, natural language processing, reinforcement learning, and robotics. However, in 3D computer vision, there are hardly any continual learning solutions specifically designed to take advantage of point cloud structure. This work proposes a novel neural network architecture capable of continual learning on 3D point cloud data. We utilize point cloud structure properties for preserving a heavily compressed set of past data. By using rehearsal and reconstruction as regularization methods of the learning process, our approach achieves a significant decrease of catastrophic forgetting compared to the existing solutions on several most popular point cloud datasets considering two continual learning settings: when a task is known beforehand, and in the challenging scenario of when task information is unknown to the model.
    Translating Hanja historical documents to understandable Korean and English. (arXiv:2205.10019v1 [cs.CL])
    The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals were originally written in an archaic Korean writing system, `Hanja', and translated into Korean from 1968 to 1993. However, this translation was literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012, completing the records of only one king in a decade. Also, expert translators are working on an English translation, of which only one king's records are available because of the high cost and slow progress. Thus, we propose H2KE, the neural machine translation model that translates Hanja historical documents to understandable Korean and English. Based on the multilingual neural machine translation approach, it translates the historical document written in Hanja, using both the full dataset of outdated Korean translation and a small dataset of recently translated Korean and English. We compare our method with two baselines: one is a recent model that simultaneously learns to restore and translate Hanja historical document and the other is the transformer that trained on newly translated corpora only. The results show that our method significantly outperforms the baselines in terms of BLEU score in both modern Korean and English translations. We also conduct a human evaluation that shows that our translation is preferred over the original expert translation.
    A Case of Exponential Convergence Rates for SVM. (arXiv:2205.10055v1 [stat.ML])
    Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalities, which have been shown to be highly sub-optimal under some margin conditions, failing short to capture exponential convergence phenomena. Those "super" fast rates are becoming to be well understood for smooth surrogates, but the picture remains blurry for non-smooth losses such as the hinge loss, associated with the renowned support vector machines. In this paper, we present a simple mechanism to obtain fast convergence rates and we investigate its usage for SVM. In particular, we show that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition.
    Semi-self-supervised Automated ICD Coding. (arXiv:2205.10088v1 [cs.CL])
    Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation of a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of data availability to the physician. Our data augmentation method shows a significant positive effect which is diminished when clinical features from the examination of the patient and diagnostics are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v3 [stat.ML] UPDATED)
    Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on a decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.
    Towards biologically plausible Dreaming and Planning. (arXiv:2205.10044v1 [cs.LG])
    Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit word-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) neural network in which "dreaming" (living new experiences in a model-based simulated environment) significantly boosts learning. We also explore "planning", an online alternative to dreaming, that shows comparable performances. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model. This is a key ingredient for biological plausibility and implementability (e.g., in neuromorphic hardware). Our network is composed of spiking neurons, further increasing the energetic efficiency and the plausibility of the model. To our knowledge, there are no previous works proposing biologically plausible model-based reinforcement learning in recurrent spiking networks. Our work is a step toward building efficient neuromorphic systems for autonomous robots, capable of learning new skills in real-world environments. Even when the environment is no longer accessible, the robot optimizes learning by "reasoning" in its own "mind". These approaches are of great relevance when the acquisition from the environment is slow, expensive (robotics) or unsafe (autonomous driving).
    Adversarial Sample Detection for Speaker Verification by Neural Vocoders. (arXiv:2107.00309v4 [cs.SD] UPDATED)
    Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited. In this paper, we adopt neural vocoders to spot adversarial samples for ASV. We use the neural vocoder to re-synthesize audio and find that the difference between the ASV scores for the original and re-synthesized audio is a good indicator for discrimination between genuine and adversarial samples. This effort is, to the best of our knowledge, among the first to pursue such a technical direction for detecting time-domain adversarial samples for ASV, and hence there is a lack of established baselines for comparison. Consequently, we implement the Griffin-Lim algorithm as the detection baseline. The proposed approach achieves effective detection performance that outperforms the baselines in all the settings. We also show that the neural vocoder adopted in the detection framework is dataset-independent. Our codes will be made open-source for future works to do fair comparison.
    User Localization using RF Sensing: A Performance comparison between LIS and mmWave Radars. (arXiv:2205.10321v1 [eess.SP])
    Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc. Two emerging technologies in RF-sensing, namely sensing through Large Intelligent Surfaces (LISs) and mmWave Frequency-Modulated Continuous-Wave (FMCW) radars, have been successfully applied to a wide range of applications. In this work, we compare LIS and mmWave radars for localization in real-world and simulated environments. In our experiments, the mmWave radar achieves 0.71 Intersection Over Union (IOU) and 3cm error for bounding boxes, while LIS has 0.56 IOU and 10cm distance error. Although the radar outperforms the LIS in terms of accuracy, LIS features additional applications in communication in addition to sensing scenarios.
    Visual Concepts Tokenization. (arXiv:2205.10093v1 [cs.CV])
    Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.
    The Fellowship of the Dyson Ring: ACT\&Friends' Results and Methods for GTOC 11. (arXiv:2205.10124v1 [cs.AI])
    Dyson spheres are hypothetical megastructures encircling stars in order to harvest most of their energy output. During the 11th edition of the GTOC challenge, participants were tasked with a complex trajectory planning related to the construction of a precursor Dyson structure, a heliocentric ring made of twelve stations. To this purpose, we developed several new approaches that synthesize techniques from machine learning, combinatorial optimization, planning and scheduling, and evolutionary optimization effectively integrated into a fully automated pipeline. These include a machine learned transfer time estimator, improving the established Edelbaum approximation and thus better informing a Lazy Race Tree Search to identify and collect asteroids with high arrival mass for the stations; a series of optimally-phased low-thrust transfers to all stations computed by indirect optimization techniques, exploiting the synodic periodicity of the system; and a modified Hungarian scheduling algorithm, which utilizes evolutionary techniques to arrange a mass-balanced arrival schedule out of all transfer possibilities. We describe the steps of our pipeline in detail with a special focus on how our approaches mutually benefit from each other. Lastly, we outline and analyze the final solution of our team, ACT&Friends, which ranked second at the GTOC 11 challenge.
    Lifelong Neural Predictive Coding: Learning Cumulatively Online without Forgetting. (arXiv:1905.10696v3 [cs.LG] UPDATED)
    In lifelong learning systems based on artificial neural networks, one of the biggest obstacles is the inability to retain old knowledge as new information is encountered. This phenomenon is known as catastrophic forgetting. In this paper, we propose a new kind of connectionist architecture, the Sequential Neural Coding Network, that is robust to forgetting when learning from streams of data points and, unlike networks of today, does not learn via the popular back-propagation of errors. Grounded in the neurocognitive theory of predictive processing, our model adapts synapses in a biologically-plausible fashion while another neural system learns to direct and control this cortex-like structure by mimicking some of task-executive control functionality of the basal ganglia. In our experiments, we demonstrate that our self-organizing system experiences significantly less forgetting compared to standard neural models, outperforming a swath of previously proposed methods, including rehearsal/data buffer-based methods, on both standard (SplitMNIST, Split Fashion MNIST, etc.) and custom benchmarks even though it is trained in a stream-like fashion. Our work offers evidence that emulating mechanisms in real neuronal systems, e.g., local learning, lateral competition, can yield new directions for tackling the grand challenge of lifelong machine learning.
    Nonlinear Initialization Methods for Low-Rank Neural Networks. (arXiv:2202.00834v3 [cs.LG] UPDATED)
    We propose a novel low-rank initialization framework for training low-rank deep neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. The most successful prior existing approach, spectral initialization, draws a sample from the initialization distribution for the full-rank setting and then optimally approximates the full-rank initialization parameters in the Frobenius norm with a pair of low-rank initialization matrices via singular value decomposition. Our method is inspired by the insight that approximating the function corresponding to each layer is more important than approximating the parameter values. We provably demonstrate that there is a significant gap between these two approaches for ReLU networks, particularly as the desired rank of the approximating weights decreases, or as the dimension of the inputs to the layer increases (the latter point holds when the network width is super-linear in dimension). Along the way, we provide the first provably efficient algorithm for solving the ReLU low-rank approximation problem for fixed parameter rank $r$ -- previously, it was unknown that the problem was computationally tractable to solve even for rank $1$. We also provide a practical algorithm to solve this problem which is no more expensive than the existing spectral initialization approach, and validate our theory by training ResNet and EfficientNet models (He et al., 2016; Tan & Le, 2019) on ImageNet (Russakovsky et al., 2015).
    Exploring the Trade-off between Plausibility, Change Intensity and Adversarial Power in Counterfactual Explanations using Multi-objective Optimization. (arXiv:2205.10232v1 [cs.LG])
    There is a broad consensus on the importance of deep learning models in tasks involving complex data. Often, an adequate understanding of these models is required when focusing on the transparency of decisions in human-critical applications. Besides other explainability techniques, trustworthiness can be achieved by using counterfactuals, like the way a human becomes familiar with an unknown process: by understanding the hypothetical circumstances under which the output changes. In this work we argue that automated counterfactual generation should regard several aspects of the produced adversarial instances, not only their adversarial capability. To this end, we present a novel framework for the generation of counterfactual examples which formulates its goal as a multi-objective optimization problem balancing three different objectives: 1) plausibility, i.e., the likeliness of the counterfactual of being possible as per the distribution of the input data; 2) intensity of the changes to the original input; and 3) adversarial power, namely, the variability of the model's output induced by the counterfactual. The framework departs from a target model to be audited and uses a Generative Adversarial Network to model the distribution of input data, together with a multi-objective solver for the discovery of counterfactuals balancing among these objectives. The utility of the framework is showcased over six classification tasks comprising image and three-dimensional data. The experiments verify that the framework unveils counterfactuals that comply with intuition, increasing the trustworthiness of the user, and leading to further insights, such as the detection of bias and data misrepresentation.
    Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors. (arXiv:2205.10279v1 [cs.LG])
    Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. Instead, we show that we can learn highly informative posteriors from the source task, through supervised or self-supervised approaches, which then serve as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on a variety of downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. These highly informative priors also can be saved for future use, similar to pre-trained weights, and stand in contrast to the zero-mean isotropic uninformative priors that are typically used in Bayesian deep learning.
    Heterformer: A Transformer Architecture for Node Representation Learning on Heterogeneous Text-Rich Networks. (arXiv:2205.10282v1 [cs.CL])
    We study node representation learning on heterogeneous text-rich networks, where nodes and edges are multi-typed and some types of nodes are associated with text information. Although recent studies on graph neural networks (GNNs) and pretrained language models (PLMs) have demonstrated their power in encoding network and text signals, respectively, less focus has been given to delicately coupling these two types of models on heterogeneous text-rich networks. Specifically, existing GNNs rarely model text in each node in a contextualized way; existing PLMs can hardly be applied to characterize graph structures due to their sequence architecture. In this paper, we propose Heterformer, a Heterogeneous GNN-nested transformer that blends GNNs and PLMs into a unified model. Different from previous "cascaded architectures" that directly add GNN layers upon a PLM, our Heterformer alternately stacks two modules - a graph-attention-based neighbor aggregation module and a transformer-based text and neighbor joint encoding module - to facilitate thorough mutual enhancement between network and text signals. Meanwhile, Heterformer is capable of characterizing network heterogeneity and nodes without text information. Comprehensive experiments on three large-scale datasets from different domains demonstrate the superiority of Heterformer over state-of-the-art baselines in link prediction, transductive/inductive node classification, node clustering, and semantics-based retrieval.
    Set-based Meta-Interpolation for Few-Task Meta-Learning. (arXiv:2205.09990v1 [cs.LG])
    Meta-learning approaches enable machine learning systems to adapt to new tasks given few examples by leveraging knowledge from related tasks. However, a large number of meta-training tasks are still required for generalization to unseen tasks during meta-testing, which introduces a critical bottleneck for real-world problems that come with only few tasks, due to various reasons including the difficulty and cost of constructing tasks. Recently, several task augmentation methods have been proposed to tackle this issue using domain-specific knowledge to design augmentation techniques to densify the meta-training task distribution. However, such reliance on domain-specific knowledge renders these methods inapplicable to other domains. While Manifold Mixup based task augmentation methods are domain-agnostic, we empirically find them ineffective on non-image domains. To tackle these limitations, we propose a novel domain-agnostic task augmentation method, Meta-Interpolation, which utilizes expressive neural set functions to densify the meta-training task distribution using bilevel optimization. We empirically validate the efficacy of Meta-Interpolation on eight datasets spanning across various domains such as image classification, molecule property prediction, text classification and speech recognition. Experimentally, we show that Meta-Interpolation consistently outperforms all the relevant baselines. Theoretically, we prove that task interpolation with the set function regularizes the meta-learner to improve generalization.
    A Proximal Algorithm for Sampling from Non-convex Potentials. (arXiv:2205.10188v1 [cs.LG])
    We study sampling problems associated with non-convex potentials that meanwhile lack smoothness. In particular, we consider target distributions that satisfy either logarithmic-Sobolev inequality or Poincar\'e inequality. Rather than smooth, the potentials are assumed to be semi-smooth or the summation of multiple semi-smooth functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling in the non-convex and semi-smooth setting. This work extends the recent algorithm in \cite{LiaChe21,LiaChe22} for non-smooth/semi-smooth log-concave distribution to the setting with non-convex potentials. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves better complexity than all existing methods.
    Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. (arXiv:2205.09853v1 [cs.CV])
    Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. https://mask-cond-video-diffusion.github.io
    A New Feature Selection Method for LogNNet and its Application for Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values. (arXiv:2205.09974v1 [cs.LG])
    Since February-2020, the world has embarked on an intense struggle with the COVID-19 disease, and health systems have come under a tragic pressure as the disease turned into a pandemic. The aim of this study is to determine the most effective routine-blood-values (RBV) in the diagnosis/prognosis of COVID-19 using new feature selection method for LogNNet reservoir neural network. First dataset in this study consists of a total of 5296-patients with a same number of negative and positive covid test. Second dataset consists of a total of 3899-patients with a diagnosis of COVID-19, who were treated in hospital with severe-infected (203) and mildly-infected (3696). The most important RBVs that affect the diagnosis of the disease from the first dataset were mean-corpuscular-hemoglobin-concentration (MCHC), mean-corpuscular-hemoglobin (MCH) and activated-partial-prothrombin-time (aPTT). The most effective features in the prognosis of the disease were erythrocyte-sedimentation-rate (ESR), neutrophil-count (NEU), C-reactive-protein (CRP). LogNNet-model achieved an accuracy rate of A46 = 99.5% in the diagnosis of the disease with 46 features and A3 = 99.17% with only MCHC, MCH, and aPTT features. Model reached an accuracy rate of A48 = 94.4% in determining the prognosis of the disease with 48 features and A3 = 82.7% with only ESR, NEU, and CRP features. LogNNet model demonstrated a very high disease diagnosis/prognosis of COVID-19 performance without knowing about the symptoms or history of the patients. The model is suitable for devices with low resources (3-14 kB of RAM used on the Arduino microcontroller), and is promising to create mobile health monitoring systems in the Internet of Things. Our method will reduce the negative pressures on the health sector and help doctors understand pathogenesis of COVID-19 through key futures and contribute positively to the treatment processes.
    The price of ignorance: how much does it cost to forget noise structure in low-rank matrix estimation?. (arXiv:2205.10009v1 [cs.IT])
    We consider the problem of estimating a rank-1 signal corrupted by structured rotationally invariant noise, and address the following question: how well do inference algorithms perform when the noise statistics is unknown and hence Gaussian noise is assumed? While the matched Bayes-optimal setting with unstructured noise is well understood, the analysis of this mismatched problem is only at its premises. In this paper, we make a step towards understanding the effect of the strong source of mismatch which is the noise statistics. Our main technical contribution is the rigorous analysis of a Bayes estimator and of an approximate message passing (AMP) algorithm, both of which incorrectly assume a Gaussian setup. The first result exploits the theory of spherical integrals and of low-rank matrix perturbations; the idea behind the second one is to design and analyze an artificial AMP which, by taking advantage of the flexibility in the denoisers, is able to "correct" the mismatch. Armed with these sharp asymptotic characterizations, we unveil a rich and often unexpected phenomenology. For example, despite AMP is in principle designed to efficiently compute the Bayes estimator, the former is outperformed by the latter in terms of mean-square error. We show that this performance gap is due to an incorrect estimation of the signal norm. In fact, when the SNR is large enough, the overlaps of the AMP and the Bayes estimator coincide, and they even match those of optimal estimators taking into account the structure of the noise.
    Lifelong Personal Context Recognition. (arXiv:2205.10123v1 [cs.AI])
    We focus on the development of AIs which live in lifelong symbiosis with a human. The key prerequisite for this task is that the AI understands - at any moment in time - the personal situational context that the human is in. We outline the key challenges that this task brings forth, namely (i) handling the human-like and ego-centric nature of the the user's context, necessary for understanding and providing useful suggestions, (ii) performing lifelong context recognition using machine learning in a way that is robust to change, and (iii) maintaining alignment between the AI's and human's representations of the world through continual bidirectional interaction. In this short paper, we summarize our recent attempts at tackling these challenges, discuss the lessons learned, and highlight directions of future research. The main take-away message is that pursuing this project requires research which lies at the intersection of knowledge representation and machine learning. Neither technology can achieve this goal without the other.
    Machine Learning for Combinatorial Optimisation of Partially-Specified Problems: Regret Minimisation as a Unifying Lens. (arXiv:2205.10157v1 [cs.LG])
    It is increasingly common to solve combinatorial optimisation problems that are partially-specified. We survey the case where the objective function or the relations between variables are not known or are only partially specified. The challenge is to learn them from available data, while taking into account a set of hard constraints that a solution must satisfy, and that solving the optimisation problem (esp. during learning) is computationally very demanding. This paper overviews four seemingly unrelated approaches, that can each be viewed as learning the objective function of a hard combinatorial optimisation problem: 1) surrogate-based optimisation, 2) empirical model learning, 3) decision-focused learning (`predict + optimise'), and 4) structured-output prediction. We formalise each learning paradigm, at first in the ways commonly found in the literature, and then bring the formalisations together in a compatible way using regret. We discuss the differences and interactions between these frameworks, highlight the opportunities for cross-fertilization and survey open directions.
    Triangulation candidates for Bayesian optimization. (arXiv:2112.07457v2 [stat.CO] UPDATED)
    Bayesian optimization involves "inner optimization" over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these "tricands" and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments.
    Greedy structure learning from data that contain systematic missing values. (arXiv:2107.04184v3 [cs.LG] UPDATED)
    Learning from data that contain missing values represents a common phenomenon in many domains. Relatively few Bayesian Network structure learning algorithms account for missing data, and those that do tend to rely on standard approaches that assume missing data are missing at random, such as the Expectation-Maximisation algorithm. Because missing data are often systematic, there is a need for more pragmatic methods that can effectively deal with data sets containing missing values not missing at random. The absence of approaches that deal with systematic missing data impedes the application of BN structure learning methods to real-world problems where missingness are not random. This paper describes three variants of greedy search structure learning that utilise pairwise deletion and inverse probability weighting to maximally leverage the observed data and to limit potential bias caused by missing values. The first two of the variants can be viewed as sub-versions of the third and best performing variant, but are important in their own in illustrating the successive improvements in learning accuracy. The empirical investigations show that the proposed approach outperforms the commonly used and state-of-the-art Structural EM algorithm, both in terms of learning accuracy and efficiency, as well as both when data are missing at random and not at random.
    Robot Learning of Mobile Manipulation with Reachability Behavior Priors. (arXiv:2203.04051v2 [cs.RO] UPDATED)
    Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments. Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation. Reinforcement Learning (RL) holds the promise of endowing robots with adaptive behaviors, but most methods require prohibitively large amounts of data for learning a useful control policy. In this work, we study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks. Namely, we consider the problem of optimal base placement and the subsequent decision of whether to activate the arm for reaching a 6D target. For this, we devise a novel Hybrid RL method that handles discrete and continuous actions jointly, resorting to the Gumbel-Softmax reparameterization. Next, we train a reachability prior using data from the operational robot workspace, inspired by classical methods. Subsequently, we derive Boosted Hybrid RL (BHyRL), a novel algorithm for learning Q-functions by modeling them as a sum of residual approximators. Every time a new task needs to be learned, we can transfer our learned residuals and learn the component of the Q-function that is task-specific, hence, maintaining the task structure from prior behaviors. Moreover, we find that regularizing the target policy with a prior policy yields more expressive behaviors. We evaluate our method in simulation in reaching and fetching tasks of increasing difficulty, and we show the superior performance of BHyRL against baseline methods. Finally, we zero-transfer our learned 6D fetching policy with BHyRL to our MM robot TIAGo++. For more details and code release, please refer to our project site: irosalab.com/rlmmbp
    Test-time Batch Normalization. (arXiv:2205.10210v1 [cs.LG])
    Deep neural networks often suffer the data distribution shift between training and testing, and the batch statistics are observed to reflect the shift. In this paper, targeting of alleviating distribution shift in test time, we revisit the batch normalization (BN) in the training process and reveals two key insights benefiting test-time optimization: $(i)$ preserving the same gradient backpropagation form as training, and $(ii)$ using dataset-level statistics for robust optimization and inference. Based on the two insights, we propose a novel test-time BN layer design, GpreBN, which is optimized during testing by minimizing Entropy loss. We verify the effectiveness of our method on two typical settings with distribution shift, i.e., domain generalization and robustness tasks. Our GpreBN significantly improves the test-time performance and achieves the state of the art results.
    DEMAND: Deep Matrix Approximately NonlinearDecomposition to Identify Meta, Canonical, and Sub-Spatial Pattern of functional Magnetic Resonance Imaging in the Human Brain. (arXiv:2205.10264v1 [cs.LG])
    Deep Neural Networks (DNNs) have already become a crucial computational approach to revealing the spatial patterns in the human brain; however, there are three major shortcomings in utilizing DNNs to detect the spatial patterns in functional Magnetic Resonance Signals: 1). It is a fully connected architecture that increases the complexity of network structures that is difficult to optimize and vulnerable to overfitting; 2). The requirement of large training samples results in erasing the individual/minor patterns in feature extraction; 3). The hyperparameters are required to be tuned manually, which is time-consuming. Therefore, we propose a novel deep nonlinear matrix factorization named Deep Matrix Approximately Nonlinear Decomposition (DEMAND) in this work to take advantage of the shallow linear model, e.g., Sparse Dictionary Learning (SDL) and DNNs. At first, the proposed DEMAND employs a non-fully connected and multilayer-stacked architecture that is easier to be optimized compared with canonical DNNs; furthermore, due to the efficient architecture, training DEMAND can avoid overfitting and enables the recognition of individual/minor features based on a small dataset such as an individual data; finally, a novel rank estimator technique is introduced to tune all hyperparameters of DEMAND automatically. Moreover, the proposed DEMAND is validated by four other peer methodologies via real functional Magnetic Resonance Imaging data in the human brain. In short, the validation results demonstrate that DEMAND can reveal the reproducible meta, canonical, and sub-spatial features of the human brain more efficiently than other peer methodologies.
    Planning with Diffusion for Flexible Behavior Synthesis. (arXiv:2205.09991v1 [cs.LG])
    Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.
    Leveraging Relational Information for Learning Weakly Disentangled Representations. (arXiv:2205.10056v1 [cs.LG])
    Disentanglement is a difficult property to enforce in neural representations. This might be due, in part, to a formalization of the disentanglement problem that focuses too heavily on separating relevant factors of variation of the data in single isolated dimensions of the neural representation. We argue that such a definition might be too restrictive and not necessarily beneficial in terms of downstream tasks. In this work, we present an alternative view over learning (weakly) disentangled representations, which leverages concepts from relational learning. We identify the regions of the latent space that correspond to specific instances of generative factors, and we learn the relationships among these regions in order to perform controlled changes to the latent codes. We also introduce a compound generative model that implements such a weak disentanglement approach. Our experiments shows that the learned representations can separate the relevant factors of variation in the data, while preserving the information needed for effectively generating high quality data samples.
    Explainable Supervised Domain Adaptation. (arXiv:2205.09943v1 [cs.LG])
    Domain adaptation techniques have contributed to the success of deep learning. Leveraging knowledge from an auxiliary source domain for learning in labeled data-scarce target domain is fundamental to domain adaptation. While these techniques result in increasing accuracy, the adaptation process, particularly the knowledge leveraged from the source domain, remains unclear. This paper proposes an explainable by design supervised domain adaptation framework - XSDA-Net. We integrate a case-based reasoning mechanism into the XSDA-Net to explain the prediction of a test instance in terms of similar-looking regions in the source and target train images. We empirically demonstrate the utility of the proposed framework by curating the domain adaptation settings on datasets popularly known to exhibit part-based explainability.
    Towards Understanding Grokking: An Effective Theory of Representation Learning. (arXiv:2205.10343v1 [cs.LG])
    We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion. We find representation learning to occur only in a "Goldilocks zone" (including comprehension and grokking) between memorization and confusion. Compared to the comprehension phase, the grokking phase stays closer to the memorization phase, leading to delayed generalization. The Goldilocks phase is reminiscent of "intelligence from starvation" in Darwinian evolution, where resource limitations drive discovery of more efficient solutions. This study not only provides intuitive explanations of the origin of grokking, but also highlights the usefulness of physics-inspired tools, e.g., effective theories and phase diagrams, for understanding deep learning.
    Constructive Interpretability with CoLabel: Corroborative Integration, Complementary Features, and Collaborative Learning. (arXiv:2205.10011v1 [cs.CV])
    Machine learning models with explainable predictions are increasingly sought after, especially for real-world, mission-critical applications that require bias detection and risk mitigation. Inherent interpretability, where a model is designed from the ground-up for interpretability, provides intuitive insights and transparent explanations on model prediction and performance. In this paper, we present CoLabel, an approach to build interpretable models with explanations rooted in the ground truth. We demonstrate CoLabel in a vehicle feature extraction application in the context of vehicle make-model recognition (VMMR). CoLabel performs VMMR with a composite of interpretable features such as vehicle color, type, and make, all based on interpretable annotations of the ground truth labels. First, CoLabel performs corroborative integration to join multiple datasets that each have a subset of desired annotations of color, type, and make. Then, CoLabel uses decomposable branches to extract complementary features corresponding to desired annotations. Finally, CoLabel fuses them together for final predictions. During feature fusion, CoLabel harmonizes complementary branches so that VMMR features are compatible with each other and can be projected to the same semantic space for classification. With inherent interpretability, CoLabel achieves superior performance to the state-of-the-art black-box models, with accuracy of 0.98, 0.95, and 0.94 on CompCars, Cars196, and BoxCars116K, respectively. CoLabel provides intuitive explanations due to constructive interpretability, and subsequently achieves high accuracy and usability in mission-critical situations.
    Characteristic Neural Ordinary Differential Equations. (arXiv:2111.13207v3 [cs.LG] UPDATED)
    We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODEs model the evolution of a latent variables as the solution to an ODE, C-NODE models the evolution of the latent variables as the solution of a family of first-order quasi-linear partial differential equations (PDEs) along curves on which the PDEs reduce to ODEs, referred to as characteristic curves. This in turn allows the application of the standard frameworks for solving ODEs, namely the adjoint method. Learning optimal characteristic curves for given tasks improves the performance and computational efficiency, compared to state of the art NODE models. We prove that the C-NODE framework extends the classical NODE on classification tasks by demonstrating explicit C-NODE representable functions not expressible by NODEs. Additionally, we present C-NODE-based continuous normalizing flows, which describe the density evolution of latent variables along multiple dimensions. Empirical results demonstrate the improvements provided by the proposed method for classification and density estimation on CIFAR-10, SVHN, and MNIST datasets under a similar computational budget as the existing NODE methods. The results also provide empirical evidence that the learned curves improve the efficiency of the system through a lower number of parameters and function evaluations compared with baselines.
    Towards Extremely Fast Bilevel Optimization with Self-governed Convergence Guarantees. (arXiv:2205.10054v1 [math.OC])
    Gradient methods have become mainstream techniques for Bi-Level Optimization (BLO) in learning and vision fields. The validity of existing works heavily relies on solving a series of approximation subproblems with extraordinarily high accuracy. Unfortunately, to achieve the approximation accuracy requires executing a large quantity of time-consuming iterations and computational burden is naturally caused. This paper is thus devoted to address this critical computational issue. In particular, we propose a single-level formulation to uniformly understand existing explicit and implicit Gradient-based BLOs (GBLOs). This together with our designed counter-example can clearly illustrate the fundamental numerical and theoretical issues of GBLOs and their naive accelerations. By introducing the dual multipliers as a new variable, we then establish Bilevel Alternating Gradient with Dual Correction (BAGDC), a general framework, which significantly accelerates different categories of existing methods by taking specific settings. A striking feature of our convergence result is that, compared to those original unaccelerated GBLO versions, the fast BAGDC admits a unified non-asymptotic convergence theory towards stationarity. A variety of numerical experiments have also been conducted to demonstrate the superiority of the proposed algorithmic framework.
    Minimal Explanations for Neural Network Predictions. (arXiv:2205.09901v1 [cs.LG])
    Explaining neural network predictions is known to be a challenging problem. In this paper, we propose a novel approach which can be effectively exploited, either in isolation or in combination with other methods, to enhance the interpretability of neural model predictions. For a given input to a trained neural model, our aim is to compute a smallest set of input features so that the model prediction changes when these features are disregarded by setting them to an uninformative baseline value. While computing such minimal explanations is computationally intractable in general for fully-connected neural networks, we show that the problem becomes solvable in polynomial time by a greedy algorithm under mild assumptions on the network's activation functions. We then show that our tractability result extends seamlessly to more advanced neural architectures such as convolutional and graph neural networks. We conduct experiments to showcase the capability of our method for identifying the input features that are essential to the model's prediction.
    Evolutionary Multi-Armed Bandits with Genetic Thompson Sampling. (arXiv:2205.10113v1 [cs.NE])
    As two popular schools of machine learning, online learning and evolutionary computations have become two important driving forces behind real-world decision making engines for applications in biomedicine, economics, and engineering fields. Although there are prior work that utilizes bandits to improve evolutionary algorithms' optimization process, it remains a field of blank on how evolutionary approach can help improve the sequential decision making tasks of online learning agents such as the multi-armed bandits. In this work, we propose the Genetic Thompson Sampling, a bandit algorithm that keeps a population of agents and update them with genetic principles such as elite selection, crossover and mutations. Empirical results in multi-armed bandit simulation environments and a practical epidemic control problem suggest that by incorporating the genetic algorithm into the bandit algorithm, our method significantly outperforms the baselines in nonstationary settings. Lastly, we introduce EvoBandit, a web-based interactive visualization to guide the readers through the entire learning process and perform lightweight evaluations on the fly. We hope to engage researchers into this growing field of research with this investigation.
    Instagram Fake and Automated Account Detection. (arXiv:1910.03090v3 [cs.IR] UPDATED)
    Fake engagement is one of the significant problems in Online Social Networks (OSNs) which is used to increase the popularity of an account in an inorganic manner. The detection of fake engagement is crucial because it leads to loss of money for businesses, wrong audience targeting in advertising, wrong product predictions systems, and unhealthy social network environment. This study is related with the detection of fake and automated accounts which leads to fake engagement on Instagram. Prior to this work, there were no publicly available dataset for fake and automated accounts. For this purpose, two datasets have been published for the detection of fake and automated accounts. For the detection of these accounts, machine learning algorithms like Naive Bayes, Logistic Regression, Support Vector Machines and Neural Networks are applied. Additionally, for the detection of automated accounts, cost sensitive genetic algorithm is proposed to handle the unnatural bias in the dataset. To deal with the unevenness problem in the fake dataset, Smote-nc algorithm is implemented. For the automated and fake account detection datasets, 86% and 96% classification accuracies are obtained, respectively.
    A Fully Controllable Agent in the Path Planning using Goal-Conditioned Reinforcement Learning. (arXiv:2205.09967v1 [cs.AI])
    The aim of path planning is to reach the goal from starting point by searching for the route of an agent. In the path planning, the routes may vary depending on the number of variables such that it is important for the agent to reach various goals. Numerous studies, however, have dealt with a single goal that is predefined by the user. In the present study, I propose a novel reinforcement learning framework for a fully controllable agent in the path planning. To do this, I propose a bi-directional memory editing to obtain various bi-directional trajectories of the agent, in which the behavior of the agent and sub-goals are trained on the goal-conditioned RL. As for moving the agent in various directions, I utilize the sub-goals dedicated network, separated from a policy network. Lastly, I present the reward shaping to shorten the number of steps for the agent to reach the goal. In the experimental result, the agent was able to reach the various goals that have never been visited by the agent in the training. We confirmed that the agent could perform difficult missions such as a round trip and the agent used the shorter route with the reward shaping.
    Generating Semantic Adversarial Examples via Feature Manipulation. (arXiv:2001.02297v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks to adversarial attacks has been widely demonstrated (e.g., adversarial example attacks). Traditional attacks perform unstructured pixel-wise perturbation to fool the classifier. An alternative approach is to have perturbations in the latent space. However, such perturbations are hard to control due to the lack of interpretability and disentanglement. In this paper, we propose a more practical adversarial attack by designing structured perturbation with semantic meanings. Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes. The intuition behind our technique is that images in similar domains have some commonly shared but theme-independent semantic attributes, e.g. thickness of lines in handwritten digits, that can be bidirectionally mapped to disentangled latent codes. We generate adversarial perturbation by manipulating a single or a combination of these latent codes and propose two unsupervised semantic manipulation approaches: vector-based disentangled representation and feature map-based disentangled representation, in terms of the complexity of the latent codes and smoothness of the reconstructed images. We conduct extensive experimental evaluations on real-world image data to demonstrate the power of our attacks for black-box classifiers. We further demonstrate the existence of a universal, image-agnostic semantic adversarial example.
    Classification of Intra-Pulse Modulation of Radar Signals by Feature Fusion Based Convolutional Neural Networks. (arXiv:2205.09834v1 [cs.LG])
    Detection and classification of radars based on pulses they transmit is an important application in electronic warfare systems. In this work, we propose a novel deep-learning based technique that automatically recognizes intra-pulse modulation types of radar signals. Re-assigned spectrogram of measured radar signal and detected outliers of its instantaneous phases filtered by a special function are used for training multiple convolutional neural networks. Automatically extracted features from the networks are fused to distinguish frequency and phase modulated signals. Simulation results show that the proposed FF-CNN (Feature Fusion based Convolutional Neural Network) technique outperforms the current state-of-the-art alternatives and is easily scalable among broad range of modulation types.
    BayesPCN: A Continually Learnable Predictive Coding Associative Memory. (arXiv:2205.09930v1 [cs.LG])
    Associative memory plays an important role in human intelligence and its mechanisms have been linked to attention in machine learning. While the machine learning community's interest in associative memories has recently been rekindled, most work has focused on memory recall ($read$) over memory learning ($write$). In this paper, we present BayesPCN, a hierarchical associative memory capable of performing continual one-shot memory writes without meta-learning. Moreover, BayesPCN is able to gradually forget past observations ($forget$) to free its memory. Experiments show that BayesPCN can recall corrupted i.i.d. high-dimensional data observed hundreds of "timesteps" ago without a significant drop in recall ability compared to the state-of-the-art offline-learned associative memory models.
    EXODUS: Stable and Efficient Training of Spiking Neural Networks. (arXiv:2205.10242v1 [cs.NE])
    Spiking Neural Networks (SNNs) are gaining significant traction in machine learning tasks where energy-efficiency is of utmost importance. Training such networks using the state-of-the-art back-propagation through time (BPTT) is, however, very time-consuming. Previous work by Shrestha and Orchard [2018] employs an efficient GPU-accelerated back-propagation algorithm called SLAYER, which speeds up training considerably. SLAYER, however, does not take into account the neuron reset mechanism while computing the gradients, which we argue to be the source of numerical instability. To counteract this, SLAYER introduces a gradient scale hyperparameter across layers, which needs manual tuning. In this paper, (i) we modify SLAYER and design an algorithm called EXODUS, that accounts for the neuron reset mechanism and applies the Implicit Function Theorem (IFT) to calculate the correct gradients (equivalent to those computed by BPTT), (ii) we eliminate the need for ad-hoc scaling of gradients, thus, reducing the training complexity tremendously, (iii) we demonstrate, via computer simulations, that EXODUS is numerically stable and achieves a comparable or better performance than SLAYER especially in various tasks with SNNs that rely on temporal features. Our code is available at https://github.com/synsense/sinabs-exodus.
    Sample Complexity of Learning Heuristic Functions for Greedy-Best-First and A* Search. (arXiv:2205.09963v1 [cs.LG])
    Greedy best-first search (GBFS) and A* search (A*) are popular algorithms for path-finding on large graphs. Both use so-called heuristic functions, which estimate how close a vertex is to the goal. While heuristic functions have been handcrafted using domain knowledge, recent studies demonstrate that learning heuristic functions from data is effective in many applications. Motivated by this emerging approach, we study the sample complexity of learning heuristic functions for GBFS and A*. We build on a recent framework called \textit{data-driven algorithm design} and evaluate the \textit{pseudo-dimension} of a class of utility functions that measure the performance of parameterized algorithms. Assuming that a vertex set of size $n$ is fixed, we present $\mathrm{O}(n\lg n)$ and $\mathrm{O}(n^2\lg n)$ upper bounds on the pseudo-dimensions for GBFS and A*, respectively, parameterized by heuristic function values. The upper bound for A* can be improved to $\mathrm{O}(n^2\lg d)$ if every vertex has a degree of at most $d$ and to $\mathrm{O}(n \lg n)$ if edge weights are integers bounded by $\mathrm{poly}(n)$. We also give $\Omega(n)$ lower bounds for GBFS and A*, which imply that our bounds for GBFS and A* under the integer-weight condition are tight up to a $\lg n$ factor. Finally, we discuss a case where the performance of A* is measured by the suboptimality and show that we can sometimes obtain a better guarantee by combining a parameter-dependent worst-case bound with a sample complexity bound.
    Towards Explanation for Unsupervised Graph-Level Representation Learning. (arXiv:2205.09934v1 [cs.LG])
    Due to the superior performance of Graph Neural Networks (GNNs) in various domains, there is an increasing interest in the GNN explanation problem "\emph{which fraction of the input graph is the most crucial to decide the model's decision?}" Existing explanation methods focus on the supervised settings, \eg, node classification and graph classification, while the explanation for unsupervised graph-level representation learning is still unexplored. The opaqueness of the graph representations may lead to unexpected risks when deployed for high-stake decision-making scenarios. In this paper, we advance the Information Bottleneck principle (IB) to tackle the proposed explanation problem for unsupervised graph representations, which leads to a novel principle, \textit{Unsupervised Subgraph Information Bottleneck} (USIB). We also theoretically analyze the connection between graph representations and explanatory subgraphs on the label space, which reveals that the expressiveness and robustness of representations benefit the fidelity of explanatory subgraphs. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our developed explainer and the validity of our theoretical analysis.
    Towards Consistency in Adversarial Classification. (arXiv:2205.10022v1 [cs.LG])
    In this paper, we study the problem of consistency in the context of adversarial examples. Specifically, we tackle the following question: can surrogate losses still be used as a proxy for minimizing the $0/1$ loss in the presence of an adversary that alters the inputs at test-time? Different from the standard classification task, this question cannot be reduced to a point-wise minimization problem, and calibration needs not to be sufficient to ensure consistency. In this paper, we expose some pathological behaviors specific to the adversarial problem, and show that no convex surrogate loss can be consistent or calibrated in this context. It is therefore necessary to design another class of surrogate functions that can be used to solve the adversarial consistency issue. As a first step towards designing such a class, we identify sufficient and necessary conditions for a surrogate loss to be calibrated in both the adversarial and standard settings. Finally, we give some directions for building a class of losses that could be consistent in the adversarial framework.
    Bayesian Active Learning with Fully Bayesian Gaussian Processes. (arXiv:2205.10186v1 [cs.LG])
    The bias-variance trade-off is a well-known problem in machine learning that only gets more pronounced the less available data there is. In active learning, where labeled data is scarce or difficult to obtain, neglecting this trade-off can cause inefficient and non-optimal querying, leading to unnecessary data labeling. In this paper, we focus on active learning with Gaussian Processes (GPs). For the GP, the bias-variance trade-off is made by optimization of the two hyperparameters: the length scale and noise-term. Considering that the optimal mode of the joint posterior of the hyperparameters is equivalent to the optimal bias-variance trade-off, we approximate this joint posterior and utilize it to design two new acquisition functions. The first one is a Bayesian variant of Query-by-Committee (B-QBC), and the second is an extension that explicitly minimizes the predictive variance through a Query by Mixture of Gaussian Processes (QB-MGP) formulation. Across six common simulators, we empirically show that B-QBC, on average, achieves the best marginal likelihood, whereas QB-MGP achieves the best predictive performance. We show that incorporating the bias-variance trade-off in the acquisition functions mitigates unnecessary and expensive data labeling.
    SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning. (arXiv:2205.09986v1 [cs.CR])
    Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, the datasets used for training ML models might be under the control of an adversary mounting a data poisoning attack, and MPC prevents inspecting training sets to detect poisoning. We show that multiple MPC frameworks for private ML training are susceptible to backdoor and targeted poisoning attacks. To mitigate this, we propose SafeNet, a framework for building ensemble models in MPC with formal guarantees of robustness to data poisoning attacks. We extend the security definition of private ML training to account for poisoning and prove that our SafeNet design satisfies the definition. We demonstrate SafeNet's efficiency, accuracy, and resilience to poisoning on several machine learning datasets and models. For instance, SafeNet reduces backdoor attack success from 100% to 0% for a neural network model, while achieving 39x faster training and 36x less communication than the four-party MPC framework of Dalskov et al.
    Transformer with Memory Replay. (arXiv:2205.09869v1 [cs.LG])
    Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose \emph{Transformer with Memory Replay} (TMR), which integrates memory replay with transformer, making transformer more sample-efficient. Experiments on GLUE and SQuAD benchmark datasets show that Transformer with Memory Replay achieves at least $1\%$ point increase compared to the baseline transformer model when pretrained with the same number of examples. Further, by adopting a careful design that reduces the wall-clock time overhead of memory replay, we also empirically achieve a better runtime efficiency.
    Diverse super-resolution with pretrained deep hiererarchical VAEs. (arXiv:2205.10347v1 [cs.CV])
    Image super-resolution is a one-to-many problem, but most deep-learning based methods only provide one single solution to this problem. In this work, we tackle the problem of diverse super-resolution by reusing VD-VAE, a state-of-the art variational autoencoder (VAE). We find that the hierarchical latent representation learned by VD-VAE naturally separates the image low-frequency information, encoded in the latent groups at the top of the hierarchy, from the image high-frequency details, determined by the latent groups at the bottom of the latent hierarchy. Starting from this observation, we design a super-resolution model exploiting the specific structure of VD-VAE latent space. Specifically, we train an encoder to encode low-resolution images in the subset of VD-VAE latent space encoding the low-frequency information, and we combine this encoder with VD-VAE generative model to sample diverse super-resolved version of a low-resolution input. We demonstrate the ability of our method to generate diverse solutions to the super-resolution problem on face super-resolution with upsampling factors x4, x8, and x16.  ( 2 min )
    On pseudo-absence generation and machine learning for locust breeding ground prediction in Africa. (arXiv:2111.03904v2 [cs.LG] UPDATED)
    Desert locust outbreaks threaten the food security of a large part of Africa and have affected the livelihoods of millions of people over the years. Machine learning (ML) has been demonstrated as an effective approach to locust distribution modelling which could assist in early warning. ML requires a significant amount of labelled data to train. Most publicly available labelled data on locusts are presence-only data, where only the sightings of locusts being present at a location are recorded. Therefore, prior work using ML have resorted to pseudo-absence generation methods as a way to circumvent this issue. The most commonly used approach is to randomly sample points in a region of interest while ensuring that these sampled pseudo-absence points are at least a specific distance away from true presence points. In this paper, we compare this random sampling approach to more advanced pseudo-absence generation methods, such as environmental profiling and optimal background extent limitation, specifically for predicting desert locust breeding grounds in Africa. Interestingly, we find that for the algorithms we tested, namely logistic regression, gradient boosting, random forests and maximum entropy, all popular in prior work, the logistic model performed significantly better than the more sophisticated ensemble methods, both in terms of prediction accuracy and F1 score. Although background extent limitation combined with random sampling boosted performance for ensemble methods, for LR this was not the case, and instead, a significant improvement was obtained when using environmental profiling. In light of this, we conclude that a simpler ML approach such as logistic regression combined with more advanced pseudo-absence generation, specifically environmental profiling, can be a sensible and effective approach to predicting locust breeding grounds across Africa.  ( 3 min )
    Recurrent segmentation meets block models in temporal networks. (arXiv:2205.09862v1 [cs.SI])
    A popular approach to model interactions is to represent them as a network with nodes being the agents and the interactions being the edges. Interactions are often timestamped, which leads to having timestamped edges. Many real-world temporal networks have a recurrent or possibly cyclic behaviour. For example, social network activity may be heightened during certain hours of day. In this paper, our main interest is to model recurrent activity in such temporal networks. As a starting point we use stochastic block model, a popular choice for modelling static networks, where nodes are split into $R$ groups. We extend this model to temporal networks by modelling the edges with a Poisson process. We make the parameters of the process dependent on time by segmenting the time line into $K$ segments. To enforce the recurring activity we require that only $H < K$ different set of parameters can be used, that is, several, not necessarily consecutive, segments must share their parameters. We prove that the searching for optimal blocks and segmentation is an NP-hard problem. Consequently, we split the problem into 3 subproblems where we optimize blocks, model parameters, and segmentation in turn while keeping the remaining structures fixed. We propose an iterative algorithm that requires $O(KHm + Rn + R^2H)$ time per iteration, where $n$ and $m$ are the number of nodes and edges in the network. We demonstrate experimentally that the number of required iterations is typically low, the algorithm is able to discover the ground truth from synthetic datasets, and show that certain real-world networks exhibit recurrent behaviour as the likelihood does not deteriorate when $H$ is lowered.  ( 2 min )
    Cross Reconstruction Transformer for Self-Supervised Time Series Representation Learning. (arXiv:2205.09928v1 [cs.LG])
    Unsupervised/self-supervised representation learning in time series is critical since labeled samples are usually scarce in real-world scenarios. Existing approaches mainly leverage the contrastive learning framework, which automatically learns to understand the similar and dissimilar data pairs. Nevertheless, they are restricted to the prior knowledge of constructing pairs, cumbersome sampling policy, and unstable performances when encountering sampling bias. Also, few works have focused on effectively modeling across temporal-spectral relations to extend the capacity of representations. In this paper, we aim at learning representations for time series from a new perspective and propose Cross Reconstruction Transformer (CRT) to solve the aforementioned problems in a unified way. CRT achieves time series representation learning through a cross-domain dropping-reconstruction task. Specifically, we transform time series into the frequency domain and randomly drop certain parts in both time and frequency domains. Dropping can maximally preserve the global context compared to cropping and masking. Then a transformer architecture is utilized to adequately capture the cross-domain correlations between temporal and spectral information through reconstructing data in both domains, which is called Dropped Temporal-Spectral Modeling. To discriminate the representations in global latent space, we propose Instance Discrimination Constraint to reduce the mutual information between different time series and sharpen the decision boundaries. Additionally, we propose a specified curriculum learning strategy to optimize the CRT, which progressively increases the dropping ratio in the training process.  ( 2 min )
    ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities. (arXiv:2205.10312v1 [cs.DB])
    Entity alignment (EA) aims at finding equivalent entities in different knowledge graphs (KGs). Embedding-based approaches have dominated the EA task in recent years. Those methods face problems that come from the geometric properties of embedding vectors, including hubness and isolation. To solve these geometric problems, many normalization approaches have been adopted to EA. However, the increasing scale of KGs renders it is hard for EA models to adopt the normalization processes, thus limiting their usage in real-world applications. To tackle this challenge, we present ClusterEA, a general framework that is capable of scaling up EA models and enhancing their results by leveraging normalization methods on mini-batches with a high entity equivalent rate. ClusterEA contains three components to align entities between large-scale KGs, including stochastic training, ClusterSampler, and SparseFusion. It first trains a large-scale Siamese GNN for EA in a stochastic fashion to produce entity embeddings. Based on the embeddings, a novel ClusterSampler strategy is proposed for sampling highly overlapped mini-batches. Finally, ClusterEA incorporates SparseFusion, which normalizes local and global similarity and then fuses all similarity matrices to obtain the final similarity matrix. Extensive experiments with real-life datasets on EA benchmarks offer insight into the proposed framework, and suggest that it is capable of outperforming the state-of-the-art scalable EA framework by up to 8 times in terms of Hits@1.  ( 2 min )
    On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. (arXiv:2205.10287v1 [cs.LG])
    Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.  ( 2 min )
    HeadText: Exploring Hands-free Text Entry using Head Gestures by Motion Sensing on a Smart Earpiece. (arXiv:2205.09978v1 [cs.HC])
    We present HeadText, a hands-free technique on a smart earpiece for text entry by motion sensing. Users input text utilizing only 7 head gestures for key selection, word selection, word commitment and word cancelling tasks. Head gesture recognition is supported by motion sensing on a smart earpiece to capture head moving signals and machine learning algorithms (K-Nearest-Neighbor (KNN) with a Dynamic Time Warping (DTW) distance measurement). A 10-participant user study proved that HeadText could recognize 7 head gestures at an accuracy of 94.29%. After that, the second user study presented that HeadText could achieve a maximum accuracy of 10.65 WPM and an average accuracy of 9.84 WPM for text entry. Finally, we demonstrate potential applications of HeadText in hands-free scenarios for (a). text entry of people with motor impairments, (b). private text entry, and (c). socially acceptable text entry.  ( 2 min )
    Interpolating Compressed Parameter Subspaces. (arXiv:2205.09891v1 [cs.LG])
    Inspired by recent work on neural subspaces and mode connectivity, we revisit parameter subspace sampling for shifted and/or interpolatable input distributions (instead of a single, unshifted distribution). We enforce a compressed geometric structure upon a set of trained parameters mapped to a set of train-time distributions, denoting the resulting subspaces as Compressed Parameter Subspaces (CPS). We show the success and failure modes of the types of shifted distributions whose optimal parameters reside in the CPS. We find that ensembling point-estimates within a CPS can yield a high average accuracy across a range of test-time distributions, including backdoor, adversarial, permutation, stylization and rotation perturbations. We also find that the CPS can contain low-loss point-estimates for various task shifts (albeit interpolated, perturbed, unseen or non-identical coarse labels). We further demonstrate this property in a continual learning setting with CIFAR100.  ( 2 min )
    ExMo: Explainable AI Model using Inverse Frequency Decision Rules. (arXiv:2205.10045v1 [cs.AI])
    In this paper, we present a novel method to compute decision rules to build a more accurate interpretable machine learning model, denoted as ExMo. The ExMo interpretable machine learning model consists of a list of IF...THEN... statements with a decision rule in the condition. This way, ExMo naturally provides an explanation for a prediction using the decision rule that was triggered. ExMo uses a new approach to extract decision rules from the training data using term frequency-inverse document frequency (TF-IDF) features. With TF-IDF, decision rules with feature values that are more relevant to each class are extracted. Hence, the decision rules obtained by ExMo can distinguish the positive and negative classes better than the decision rules used in the existing Bayesian Rule List (BRL) algorithm, obtained using the frequent pattern mining approach. The paper also shows that ExMo learns a qualitatively better model than BRL. Furthermore, ExMo demonstrates that the textual explanation can be provided in a human-friendly way so that the explanation can be easily understood by non-expert users. We validate ExMo on several datasets with different sizes to evaluate its efficacy. Experimental validation on a real-world fraud detection application shows that ExMo is 20% more accurate than BRL and that it achieves accuracy similar to those of deep learning models.  ( 2 min )
    Visualizing and Explaining Language Models. (arXiv:2205.10238v1 [cs.CL])
    During the last decade, Natural Language Processing has become, after Computer Vision, the second field of Artificial Intelligence that was massively changed by the advent of Deep Learning. Regardless of the architecture, the language models of the day need to be able to process or generate text, as well as predict missing words, sentences or relations depending on the task. Due to their black-box nature, such models are difficult to interpret and explain to third parties. Visualization is often the bridge that language model designers use to explain their work, as the coloring of the salient words and phrases, clustering or neuron activations can be used to quickly understand the underlying models. This paper showcases the techniques used in some of the most popular Deep Learning for NLP visualizations, with a special focus on interpretability and explainability.  ( 2 min )
    Understanding and Mitigating the Uncertainty in Zero-Shot Translation. (arXiv:2205.10068v1 [cs.CL])
    Zero-shot translation is a promising direction for building a comprehensive multilingual neural machine translation (MNMT) system. However, its quality is still not satisfactory due to off-target issues. In this paper, we aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation. By carefully examining the translation output and model confidence, we identify two uncertainties that are responsible for the off-target issues, namely, extrinsic data uncertainty and intrinsic model uncertainty. Based on the observations, we propose two light-weight and complementary approaches to denoise the training data for model training, and mask out the vocabulary of the off-target languages in inference. Extensive experiments on both balanced and unbalanced datasets show that our approaches significantly improve the performance of zero-shot translation over strong MNMT baselines. Qualitative analyses provide insights into where our approaches reduce off-target translations  ( 2 min )
    Estimating the frame potential of large-scale quantum circuit sampling using tensor networks up to 50 qubits. (arXiv:2205.09900v1 [quant-ph])
    We develop numerical protocols for estimating the frame potential, the 2-norm distance between a given ensemble and the exact Haar randomness, using the \texttt{QTensor} platform. Our tensor-network-based algorithm has polynomial complexity for shallow circuits and is high performing using CPU and GPU parallelism. We apply the above methods to two problems: the Brown-Susskind conjecture, with local and parallel random circuits in terms of the Haar distance and the approximate $k$-design properties of the hardware efficient ans{\"a}tze in quantum machine learning, which induce the barren plateau problem. We estimate frame potentials with these ensembles up to 50 qubits and $k=5$, examine the Haar distance of the hardware-efficient ans{\"a}tze, and verify the Brown-Susskind conjecture numerically. Our work shows that large-scale tensor network simulations could provide important hints toward open problems in quantum information science.  ( 2 min )
    FairNorm: Fair and Fast Graph Neural Network Training. (arXiv:2205.09977v1 [cs.LG])
    Graph neural networks (GNNs) have been demonstrated to achieve state-of-the-art for a number of graph-based learning tasks, which leads to a rise in their employment in various domains. However, it has been shown that GNNs may inherit and even amplify bias within training data, which leads to unfair results towards certain sensitive groups. Meanwhile, training of GNNs introduces additional challenges, such as slow convergence and possible instability. Faced with these limitations, this work proposes FairNorm, a unified normalization framework that reduces the bias in GNN-based learning while also providing provably faster convergence. Specifically, FairNorm employs fairness-aware normalization operators over different sensitive groups with learnable parameters to reduce the bias in GNNs. The design of FairNorm is built upon analyses that illuminate the sources of bias in graph-based learning. Experiments on node classification over real-world networks demonstrate the efficiency of the proposed scheme in improving fairness in terms of statistical parity and equal opportunity compared to fairness-aware baselines. In addition, it is empirically shown that the proposed framework leads to faster convergence compared to the naive baseline where no normalization is employed.  ( 2 min )
    Swapping Semantic Contents for Mixing Images. (arXiv:2205.10158v1 [cs.CV])
    Deep architecture have proven capable of solving many tasks provided a sufficient amount of labeled data. In fact, the amount of available labeled data has become the principal bottleneck in low label settings such as Semi-Supervised Learning. Mixing Data Augmentations do not typically yield new labeled samples, as indiscriminately mixing contents creates between-class samples. In this work, we introduce the SciMix framework that can learn to generator to embed a semantic style code into image backgrounds, we obtain new mixing scheme for data augmentation. We then demonstrate that SciMix yields novel mixed samples that inherit many characteristics from their non-semantic parents. Afterwards, we verify those samples can be used to improve the performance semi-supervised frameworks like Mean Teacher or Fixmatch, and even fully supervised learning on a small labeled dataset.  ( 2 min )
    On Calibration of Ensemble-Based Credal Predictors. (arXiv:2205.10082v1 [stat.ML])
    In recent years, several classification methods that intend to quantify epistemic uncertainty have been proposed, either by producing predictions in the form of second-order distributions or sets of probability distributions. In this work, we focus on the latter, also called credal predictors, and address the question of how to evaluate them: What does it mean that a credal predictor represents epistemic uncertainty in a faithful manner? To answer this question, we refer to the notion of calibration of probabilistic predictors and extend it to credal predictors. Broadly speaking, we call a credal predictor calibrated if it returns sets that cover the true conditional probability distribution. To verify this property for the important case of ensemble-based credal predictors, we propose a novel nonparametric calibration test that generalizes an existing test for probabilistic predictors to the case of credal predictors. Making use of this test, we empirically show that credal predictors based on deep neural networks are often not well calibrated.  ( 2 min )
    Trend analysis and forecasting air pollution in Rwanda. (arXiv:2205.10024v1 [stat.ML])
    Air pollution is a major public health problem worldwide although the lack of data is a global issue for most low and middle income countries. Ambient air pollution in the form of fine particulate matter (PM2.5) exceeds the World Health Organization guidelines in Rwanda with a daily average of around 42.6 microgram per meter cube. Monitoring and mitigation strategies require an expensive investment in equipment to collect pollution data. Low-cost sensor technology and machine learning methods have appeared as an alternative solution to get reliable information for decision making. This paper analyzes the trend of air pollution in Rwanda and proposes forecasting models suitable to data collected by a network of low-cost sensors deployed in Rwanda.  ( 2 min )
    Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome. (arXiv:2205.09906v1 [stat.ML])
    Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.  ( 2 min )
    Conformal Prediction with Temporal Quantile Adjustments. (arXiv:2205.09940v1 [stat.ML])
    We develop Temporal Quantile Adjustment (TQA), a general method to construct efficient and valid prediction intervals (PIs) for regression on cross-sectional time series data. Such data is common in many domains, including econometrics and healthcare. A canonical example in healthcare is predicting patient outcomes using physiological time-series data, where a population of patients composes a cross-section. Reliable PI estimators in this setting must address two distinct notions of coverage: cross-sectional coverage across a cross-sectional slice, and longitudinal coverage along the temporal dimension for each time series. Recent works have explored adapting Conformal Prediction (CP) to obtain PIs in the time series context. However, none handles both notions of coverage simultaneously. CP methods typically query a pre-specified quantile from the distribution of nonconformity scores on a calibration set. TQA adjusts the quantile to query in CP at each time $t$, accounting for both cross-sectional and longitudinal coverage in a theoretically-grounded manner. The post-hoc nature of TQA facilitates its use as a general wrapper around any time series regression model. We validate TQA's performance through extensive experimentation: TQA generally obtains efficient PIs and improves longitudinal coverage while preserving cross-sectional coverage.  ( 2 min )
    A General Framework for quantifying Aleatoric and Epistemic uncertainty in Graph Neural Networks. (arXiv:2205.09968v1 [cs.LG])
    Graph Neural Networks (GNN) provide a powerful framework that elegantly integrates Graph theory with Machine learning for modeling and analysis of networked data. We consider the problem of quantifying the uncertainty in predictions of GNN stemming from modeling errors and measurement uncertainty. We consider aleatoric uncertainty in the form of probabilistic links and noise in feature vector of nodes, while epistemic uncertainty is incorporated via a probability distribution over the model parameters. We propose a unified approach to treat both sources of uncertainty in a Bayesian framework, where Assumed Density Filtering is used to quantify aleatoric uncertainty and Monte Carlo dropout captures uncertainty in model parameters. Finally, the two sources of uncertainty are aggregated to estimate the total uncertainty in predictions of a GNN. Results in the real-world datasets demonstrate that the Bayesian model performs at par with a frequentist model and provides additional information about predictions uncertainty that are sensitive to uncertainties in the data and model.  ( 2 min )
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v1 [cs.LG])
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.  ( 2 min )
    Summarization as Indirect Supervision for Relation Extraction. (arXiv:2205.09837v1 [cs.CL])
    Relation extraction (RE) models have been challenged by their reliance on training data with expensive annotations. Considering that summarization tasks aim at acquiring concise expressions of synoptical information from the longer context, these tasks naturally align with the objective of RE, i.e., extracting a kind of synoptical information that describes the relation of entity mentions. We present SuRE, which converts RE into a summarization formulation. SuRE leads to more precise and resource-efficient RE based on indirect supervision from summarization tasks. To achieve this goal, we develop sentence and relation conversion techniques that essentially bridge the formulation of summarization and RE tasks. We also incorporate constraint decoding techniques with Trie scoring to further enhance summarization-based RE with robust inference. Experiments on three RE datasets demonstrate the effectiveness of SuRE in both full-dataset and low-resource settings, showing that summarization is a promising source of indirect supervision to improve RE models.  ( 2 min )
    Self-Supervised Depth Estimation with Isometric-Self-Sample-Based Learning. (arXiv:2205.10006v1 [cs.CV])
    Managing the dynamic regions in the photometric loss formulation has been a main issue for handling the self-supervised depth estimation problem. Most previous methods have alleviated this issue by removing the dynamic regions in the photometric loss formulation based on the masks estimated from another module, making it difficult to fully utilize the training images. In this paper, to handle this problem, we propose an isometric self-sample-based learning (ISSL) method to fully utilize the training images in a simple yet effective way. The proposed method provides additional supervision during training using self-generated images that comply with pure static scene assumption. Specifically, the isometric self-sample generator synthesizes self-samples for each training image by applying random rigid transformations on the estimated depth. Thus both the generated self-samples and the corresponding training image always follow the static scene assumption. We show that plugging our ISSL module into several existing models consistently improves the performance by a large margin. In addition, it also boosts the depth accuracy over different types of scene, i.e., outdoor scenes (KITTI and Make3D) and indoor scene (NYUv2), validating its high effectiveness.  ( 2 min )
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v1 [stat.ML])
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.  ( 2 min )
    Survey on Fair Reinforcement Learning: Theory and Practice. (arXiv:2205.10032v1 [cs.LG])
    Fairness-aware learning aims at satisfying various fairness constraints in addition to the usual performance criteria via data-driven machine learning techniques. Most of the research in fairness-aware learning employs the setting of fair-supervised learning. However, many dynamic real-world applications can be better modeled using sequential decision-making problems and fair reinforcement learning provides a more suitable alternative for addressing these problems. In this article, we provide an extensive overview of fairness approaches that have been implemented via a reinforcement learning (RL) framework. We discuss various practical applications in which RL methods have been applied to achieve a fair solution with high accuracy. We further include various facets of the theory of fair reinforcement learning, organizing them into single-agent RL, multi-agent RL, long-term fairness via RL, and offline learning. Moreover, we highlight a few major issues to explore in order to advance the field of fair-RL, namely - i) correcting societal biases, ii) feasibility of group fairness or individual fairness, and iii) explainability in RL. Our work is beneficial for both researchers and practitioners as we discuss articles providing mathematical guarantees as well as articles with empirical studies on real-world problems.  ( 2 min )
    FedNoiL: A Simple Two-Level Sampling Method for Federated Learning with Noisy Labels. (arXiv:2205.10110v1 [cs.LG])
    Federated learning (FL) aims at training a global model on the server side while the training data are collected and located at the local devices. Hence, the labels in practice are usually annotated by clients of varying expertise or criteria and thus contain different amounts of noises. Local training on noisy labels can easily result in overfitting to noisy labels, which is devastating to the global model through aggregation. Although recent robust FL methods take malicious clients into account, they have not addressed local noisy labels on each device and the impact to the global model. In this paper, we develop a simple two-level sampling method "FedNoiL" that (1) selects clients for more robust global aggregation on the server; and (2) selects clean labels and correct pseudo-labels at the client end for more robust local training. The sampling probabilities are built upon clean label detection by the global model. Moreover, we investigate different schedules changing the local epochs between aggregations over the course of FL, which notably improves the communication and computation efficiency in noisy label setting. In experiments with homogeneous/heterogeneous data distributions and noise ratios, we observed that direct combinations of SOTA FL methods with SOTA noisy-label learning methods can easily fail but our method consistently achieves better and robust performance.  ( 2 min )
    Exploring Extreme Parameter Compression for Pre-trained Language Models. (arXiv:2205.10036v1 [cs.CL])
    Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.  ( 2 min )
    MiDAS: Multi-integrated Domain Adaptive Supervision for Fake News Detection. (arXiv:2205.09817v1 [cs.LG])
    COVID-19 related misinformation and fake news, coined an 'infodemic', has dramatically increased over the past few years. This misinformation exhibits concept drift, where the distribution of fake news changes over time, reducing effectiveness of previously trained models for fake news detection. Given a set of fake news models trained on multiple domains, we propose an adaptive decision module to select the best-fit model for a new sample. We propose MiDAS, a multi-domain adaptative approach for fake news detection that ranks relevancy of existing models to new samples. MiDAS contains 2 components: a doman-invariant encoder, and an adaptive model selector. MiDAS integrates multiple pre-trained and fine-tuned models with their training data to create a domain-invariant representation. Then, MiDAS uses local Lipschitz smoothness of the invariant embedding space to estimate each model's relevance to a new sample. Higher ranked models provide predictions, and lower ranked models abstain. We evaluate MiDAS on generalization to drifted data with 9 fake news datasets, each obtained from different domains and modalities. MiDAS achieves new state-of-the-art performance on multi-domain adaptation for out-of-distribution fake news classification.  ( 2 min )
    Neural Additive Models for Nowcasting. (arXiv:2205.10020v1 [cs.LG])
    Deep neural networks (DNNs) are one of the most highlighted methods in machine learning. However, as DNNs are black-box models, they lack explanatory power for their predictions. Recently, neural additive models (NAMs) have been proposed to provide this power while maintaining high prediction performance. In this paper, we propose a novel NAM approach for multivariate nowcasting (NC) problems, which comprise an important focus area of machine learning. For the multivariate time-series data used in NC problems, explanations should be considered for every input value to the variables at distinguishable time steps. By employing generalized additive models, the proposed NAM-NC successfully explains each input value's importance for multiple variables and time steps. Experimental results involving a toy example and two real-world datasets show that the NAM-NC predicts multivariate time-series data as accurately as state-of-the-art neural networks, while also providing the explanatory importance of each input value. We also examine parameter-sharing networks using NAM-NC to decrease their complexity, and NAM-MC's hard-tied feature net extracted explanations with good performance.  ( 2 min )
    Service Delay Minimization for Federated Learning over Mobile Devices. (arXiv:2205.09868v1 [cs.LG])
    Federated learning (FL) over mobile devices has fostered numerous intriguing applications/services, many of which are delay-sensitive. In this paper, we propose a service delay efficient FL (SDEFL) scheme over mobile devices. Unlike traditional communication efficient FL, which regards wireless communications as the bottleneck, we find that under many situations, the local computing delay is comparable to the communication delay during the FL training process, given the development of high-speed wireless transmission techniques. Thus, the service delay in FL should be computing delay + communication delay over training rounds. To minimize the service delay of FL, simply reducing local computing/communication delay independently is not enough. The delay trade-off between local computing and wireless communications must be considered. Besides, we empirically study the impacts of local computing control and compression strategies (i.e., the number of local updates, weight quantization, and gradient quantization) on computing, communication and service delays. Based on those trade-off observation and empirical studies, we develop an optimization scheme to minimize the service delay of FL over heterogeneous devices. We establish testbeds and conduct extensive emulations/experiments to verify our theoretical analysis. The results show that SDEFL reduces notable service delay with a small accuracy drop compared to peer designs.  ( 2 min )
    Residual Dynamic Mode Decomposition: Robust and verified Koopmanism. (arXiv:2205.09779v1 [physics.flu-dyn])
    Dynamic Mode Decomposition (DMD) describes complex dynamic processes through a hierarchy of simpler coherent features. DMD is regularly used to understand the fundamental characteristics of turbulence and is closely related to Koopman operators. However, verifying the decomposition, equivalently the computed spectral features of Koopman operators, remains a major challenge due to the infinite-dimensional nature of Koopman operators. Challenges include spurious (unphysical) modes, and dealing with continuous spectra, both of which occur regularly in turbulent flows. Residual Dynamic Mode Decomposition (ResDMD), introduced by (Colbrook & Townsend 2021), overcomes some of these challenges through the data-driven computation of residuals associated with the full infinite-dimensional Koopman operator. ResDMD computes spectra and pseudospectra of general Koopman operators with error control, and computes smoothed approximations of spectral measures (including continuous spectra) with explicit high-order convergence theorems. ResDMD thus provides robust and verified Koopmanism. We implement ResDMD and demonstrate its application in a variety of fluid dynamic situations, at varying Reynolds numbers, arising from both numerical and experimental data. Examples include: vortex shedding behind a cylinder; hot-wire data acquired in a turbulent boundary layer; particle image velocimetry data focusing on a wall-jet flow; and acoustic pressure signals of laser-induced plasma. We present some advantages of ResDMD, namely, the ability to verifiably resolve non-linear, transient modes, and spectral calculation with reduced broadening effects. We also discuss how a new modal ordering based on residuals enables greater accuracy with a smaller dictionary than the traditional modulus ordering. This paves the way for greater dynamic compression of large datasets without sacrificing accuracy.  ( 2 min )
    Why GANs are overkill for NLP. (arXiv:2205.09838v1 [cs.LG])
    This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as popular for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely artificial and only holds for limited models. We argue that minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered.  ( 2 min )
    Discrete-Convex-Analysis-Based Framework for Warm-Starting Algorithms with Predictions. (arXiv:2205.09961v1 [cs.LG])
    Augmenting algorithms with learned predictions is a promising approach for going beyond worst-case bounds. Dinitz, Im, Lavastida, Moseley, and Vassilvitskii~(2021) have demonstrated that a warm start with learned dual solutions can improve the time complexity of the Hungarian method for weighted perfect bipartite matching. We extend and improve their framework in a principled manner via \textit{discrete convex analysis} (DCA), a discrete analog of convex analysis. We show the usefulness of our DCA-based framework by applying it to weighted perfect bipartite matching, weighted matroid intersection, and discrete energy minimization for computer vision. Our DCA-based framework yields time complexity bounds that depend on the $\ell_\infty$-distance from a predicted solution to an optimal solution, which has two advantages relative to the previous $\ell_1$-distance-dependent bounds: time complexity bounds are smaller, and learning of predictions is more sample efficient. We also discuss whether to learn primal or dual solutions from the DCA perspective.  ( 2 min )
    A Learning-Based Approach to Approximate Coded Computation. (arXiv:2205.09818v1 [cs.IT])
    Lagrange coded computation (LCC) is essential to solving problems about matrix polynomials in a coded distributed fashion; nevertheless, it can only solve the problems that are representable as matrix polynomials. In this paper, we propose AICC, an AI-aided learning approach that is inspired by LCC but also uses deep neural networks (DNNs). It is appropriate for coded computation of more general functions. Numerical simulations demonstrate the suitability of the proposed approach for the coded computation of different matrix functions that are often utilized in digital signal processing.  ( 2 min )
    Calibration Matters: Tackling Maximization Bias in Large-scale Advertising Recommendation Systems. (arXiv:2205.09809v1 [cs.LG])
    Calibration is defined as the ratio of the average predicted click rate to the true click rate. The optimization of calibration is essential to many online advertising recommendation systems because it directly affects the downstream bids in ads auctions and the amount of money charged to advertisers. Despite its importance, calibration optimization often suffers from a problem called "maximization bias". Maximization bias refers to the phenomenon that the maximum of predicted values overestimates the true maximum. The problem is introduced because the calibration is computed on the set selected by the prediction model itself. It persists even if unbiased predictions can be achieved on every datapoint and worsens when covariate shifts exist between the training and test sets. To mitigate this problem, we theorize the quantification of maximization bias and propose a variance-adjusting debiasing (VAD) meta-algorithm in this paper. The algorithm is efficient, robust, and practical as it is able to mitigate maximization bias problems under covariate shifts, neither incurring additional online serving costs nor compromising the ranking performance. We demonstrate the effectiveness of the proposed algorithm using a state-of-the-art recommendation neural network model on a large-scale real-world dataset.  ( 2 min )
    Estimation of Entropy in Constant Space with Improved Sample Complexity. (arXiv:2205.09804v1 [cs.DS])
    Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pm\epsilon$ additive error by streaming over $(k/\epsilon^3) \cdot \text{polylog}(1/\epsilon)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/\epsilon^2)\cdot \text{polylog}(1/\epsilon)$. We conjecture that this is optimal up to $\text{polylog}(1/\epsilon)$ factors.  ( 2 min )
    Causal Discovery and Injection for Feed-Forward Neural Networks. (arXiv:2205.09787v1 [cs.LG])
    Neural networks have proven to be effective at solving a wide range of problems but it is often unclear whether they learn any meaningful causal relationship: this poses a problem for the robustness of neural network models and their use for high-stakes decisions. We propose a novel method overcoming this issue by injecting knowledge in the form of (possibly partial) causal graphs into feed-forward neural networks, so that the learnt model is guaranteed to conform to the graph, hence adhering to expert knowledge. This knowledge may be given up-front or during the learning process, to improve the model through human-AI collaboration. We apply our method to synthetic and real (tabular) data showing that it is robust against noise and can improve causal discovery and prediction performance in low data regimes.  ( 2 min )
    Label-invariant Augmentation for Semi-Supervised Graph Classification. (arXiv:2205.09802v1 [cs.CV])
    Recently, contrastiveness-based augmentation surges a new climax in the computer vision domain, where some operations, including rotation, crop, and flip, combined with dedicated algorithms, dramatically increase the model generalization and robustness. Following this trend, some pioneering attempts employ the similar idea to graph data. Nevertheless, unlike images, it is much more difficult to design reasonable augmentations without changing the nature of graphs. Although exciting, the current graph contrastive learning does not achieve as promising performance as visual contrastive learning. We conjecture the current performance of graph contrastive learning might be limited by the violation of the label-invariant augmentation assumption. In light of this, we propose a label-invariant augmentation for graph-structured data to address this challenge. Different from the node/edge modification and subgraph extraction, we conduct the augmentation in the representation space and generate the augmented samples in the most difficult direction while keeping the label of augmented data the same as the original samples. In the semi-supervised scenario, we demonstrate our proposed method outperforms the classical graph neural network based methods and recent graph contrastive learning on eight benchmark graph-structured data, followed by several in-depth experiments to further explore the label-invariant augmentation in several aspects.  ( 2 min )
    Graph Neural Networks Are More Powerful Than we Think. (arXiv:2205.09801v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that the WL algorithm is the upper bound only when the input to the GNN is the vector of all ones. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We show that GNNs can distinguish between any graphs that differ in at least one eigenvalue and design simple GNN architectures that are provably more expressive than the WL algorithm. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed architectures.  ( 2 min )
    HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding. (arXiv:2205.09753v1 [cs.AI])
    One essential task for autonomous driving is to encode the information of a driving scene into vector representations so that the downstream task such as trajectory prediction could perform well. The driving scene is complicated, and there exists heterogeneity within elements, where they own diverse types of information i.e., agent dynamics, map routing, road lines, etc. Meanwhile, there also exist relativity across elements - meaning they have spatial relations with each other; such relations should be canonically represented regarding the relative measurements since the absolute value of the coordinate is meaningless. Taking these two observations into consideration, we propose a novel backbone, namely Heterogeneous Driving Graph Transformer (HDGT), which models the driving scene as a heterogeneous graph with different types of nodes and edges. For graph construction, each node represents either an agent or a road element and each edge represents their semantics relations such as Pedestrian-To-Crosswalk, Lane-To-Left-Lane. As for spatial relation encoding, instead of setting a fixed global reference, the coordinate information of the node as well as its in-edges is transformed to the local node-centric coordinate system. For the aggregation module in the graph neural network (GNN), we adopt the transformer structure in a hierarchical way to fit the heterogeneous nature of inputs. Experimental results show that the proposed method achieves new state-of-the-art on INTERACTION Prediction Challenge and Waymo Open Motion Challenge, in which we rank 1st and 2nd respectively regarding the minADE/minFDE metric.  ( 2 min )
  • Open

    Long-Range Transformers for Dynamic Spatiotemporal Forecasting. (arXiv:2109.12218v2 [cs.LG] UPDATED)
    Multivariate Time Series Forecasting focuses on the prediction of future values based on historical context. State-of-the-art sequence-to-sequence models rely on neural attention between timesteps, which allows for temporal learning but fails to consider distinct spatial relationships between variables. In contrast, methods based on graph neural networks explicitly model variable relationships. However, these methods often rely on predefined graphs and perform separate spatial and temporal updates without establishing direct connections between each variable at every timestep. This paper addresses these problems by translating multivariate forecasting into a spatiotemporal sequence formulation where each Transformer input token represents the value of a single variable at a given time. Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence. Our method, which we call Spacetimeformer, achieves competitive results on benchmarks from traffic forecasting to electricity demand and weather prediction while learning fully-connected spatiotemporal relationships purely from data.
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v2 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
    Trend analysis and forecasting air pollution in Rwanda. (arXiv:2205.10024v1 [stat.ML])
    Air pollution is a major public health problem worldwide although the lack of data is a global issue for most low and middle income countries. Ambient air pollution in the form of fine particulate matter (PM2.5) exceeds the World Health Organization guidelines in Rwanda with a daily average of around 42.6 microgram per meter cube. Monitoring and mitigation strategies require an expensive investment in equipment to collect pollution data. Low-cost sensor technology and machine learning methods have appeared as an alternative solution to get reliable information for decision making. This paper analyzes the trend of air pollution in Rwanda and proposes forecasting models suitable to data collected by a network of low-cost sensors deployed in Rwanda.
    An alternative proof of the vulnerability of retrieval in high intrinsic dimensionality neighborhood. (arXiv:2010.00990v2 [cs.LG] UPDATED)
    This paper investigates the vulnerability of the nearest neighbors search, which is a pivotal tool in data analysis and machine learning. The vulnerability is gauged as the relative amount of perturbation that an attacker needs to add onto a dataset point in order to modify its neighbor rank w.r.t. a query. The statistical distribution of this quantity is derived from simple assumptions. Experiments on six large scale datasets validate this model up to some outliers which are explained in term of violations of the assumptions.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v3 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Robust Expected Information Gain for Optimal Bayesian Experimental Design Using Ambiguity Sets. (arXiv:2205.09914v1 [stat.ML])
    The ranking of experiments by expected information gain (EIG) in Bayesian experimental design is sensitive to changes in the model's prior distribution, and the approximation of EIG yielded by sampling will have errors similar to the use of a perturbed prior. We define and analyze \emph{robust expected information gain} (REIG), a modification of the objective in EIG maximization by minimizing an affine relaxation of EIG over an ambiguity set of distributions that are close to the original prior in KL-divergence. We show that, when combined with a sampling-based approach to estimating EIG, REIG corresponds to a `log-sum-exp' stabilization of the samples used to estimate EIG, meaning that it can be efficiently implemented in practice. Numerical tests combining REIG with variational nested Monte Carlo (VNMC), adaptive contrastive estimation (ACE) and mutual information neural estimation (MINE) suggest that in practice REIG also compensates for the variability of under-sampled estimators.
    Deep electric field predictions by drift-reduced Braginskii theory with plasma-neutral interactions based upon experimental images of boundary turbulence. (arXiv:2204.11689v1 [physics.plasm-ph] CROSS LISTED)
    We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature obtained from analysis of gas puff imaging of a discharge on the Alcator C-Mod tokamak. The inclusion of effects from the locally puffed atomic helium on particle and energy sources within the reduced plasma turbulence model are found to strengthen correlations between the electric field and electron pressure. The neutrals are also directly associated with an observed broadening in the distribution of turbulent field amplitudes and increased ${\bf E \times B}$ shearing rates.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v1 [cs.LG])
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    A Case of Exponential Convergence Rates for SVM. (arXiv:2205.10055v1 [stat.ML])
    Classification is often the first problem described in introductory machine learning classes. Generalization guarantees of classification have historically been offered by Vapnik-Chervonenkis theory. Yet those guarantees are based on intractable algorithms, which has led to the theory of surrogate methods in classification. Guarantees offered by surrogate methods are based on calibration inequalities, which have been shown to be highly sub-optimal under some margin conditions, failing short to capture exponential convergence phenomena. Those "super" fast rates are becoming to be well understood for smooth surrogates, but the picture remains blurry for non-smooth losses such as the hinge loss, associated with the renowned support vector machines. In this paper, we present a simple mechanism to obtain fast convergence rates and we investigate its usage for SVM. In particular, we show that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition.
    Triangulation candidates for Bayesian optimization. (arXiv:2112.07457v2 [stat.CO] UPDATED)
    Bayesian optimization involves "inner optimization" over a new-data acquisition criterion which is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart local numerical optimizers. In such cases it is common to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. We detail the construction of these "tricands" and demonstrate empirically how they outperform both numerically optimized acquisitions and random candidate-based alternatives, and are well-suited for hybrid schemes, on benchmark synthetic and real simulation experiments.
    Speeding up PCA with priming. (arXiv:2109.03709v3 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we prime pPCA by several approximate algorithms and report an average speedup by a factor of 7.2 over Oja's rule, and a factor of 10.5 over EigenGame.  ( 2 min )
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v3 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information on consumers, effective personalization of offers, goods, and services has become a core business focus for companies to improve revenues and maintain competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedure. Importantly, in many business and medical settings, interpretability of a policy is essential. To address these challenges, we study the class of policies with linear decision boundaries and propose learning algorithms using tools from causal inference. We propose several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that an adapted Bayesian optimization algorithm is fast and effective. We test our algorithm with extensive simulation studies and apply it to an online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process, e.g. when "Attribute 2" is large, marginal increase in GMV is low for discounts higher than 10\%. Our findings suggest that the proposed policy learning algorithm provides a promising practical approach for interpretable personalization across a wide range of applications.  ( 2 min )
    Algorithms for Weak Optimal Transport with an Application to Economics. (arXiv:2205.09825v1 [stat.ML])
    The theory of weak optimal transport (WOT), introduced by [Gozlan et al., 2017], generalizes the classic Monge-Kantorovich framework by allowing the transport cost between one point and the points it is matched with to be nonlinear. In the so-called barycentric version of WOT, the cost for transporting a point $x$ only depends on $x$ and on the barycenter of the points it is matched with. This aggregation property of WOT is appealing in machine learning, economics and finance. Yet algorithms to compute WOT have only been developed for the special case of quadratic barycentric WOT, or depend on neural networks with no guarantee on the computed value and matching. The main difficulty lies in the transportation constraints which are costly to project onto. In this paper, we propose to use mirror descent algorithms to solve the primal and dual versions of the WOT problem. We also apply our algorithms to the variant of WOT introduced by [Chon\'e et al., 2022] where mass is distributed from one space to another through unnormalized kernels (WOTUK). We empirically compare the solutions of WOT and WOTUK with classical OT. We illustrate our numerical methods to the economic framework of [Chon\'e and Kramarz, 2021], namely the matching between workers and firms on labor markets.
    Invariance principle of random projection for the norm. (arXiv:2112.00300v2 [math.PR] UPDATED)
    Johnson-Lindenstrauss guarantees certain topological structure is preserved under random projections when project high dimensional deterministic vectors to low dimensional vectors. In this work, we try to understand how random matrix affect norms of random vectors. In particular we prove the distribution of the norm of random vector $X \in \mathbb{R}^n$, whose entries are i.i.d. random variables, is preserved by random projection $S:\mathbb{R}^n \to \mathbb{R}^m$. More precisely, \[ \frac{X^TS^TSX - mn}{\sqrt{\sigma^2 m^2n+2mn^2}} \xrightarrow[\quad m/n\to 0 \quad ]{ m,n\to \infty } \mathcal{N}(0,1) \] We also prove a concentration of the random norm transformed by either random projection or random embedding. Overall, our results showed random matrix has low distortion for the norm of random vectors with i.i.d. entries.
    Nonlinear Initialization Methods for Low-Rank Neural Networks. (arXiv:2202.00834v3 [cs.LG] UPDATED)
    We propose a novel low-rank initialization framework for training low-rank deep neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. The most successful prior existing approach, spectral initialization, draws a sample from the initialization distribution for the full-rank setting and then optimally approximates the full-rank initialization parameters in the Frobenius norm with a pair of low-rank initialization matrices via singular value decomposition. Our method is inspired by the insight that approximating the function corresponding to each layer is more important than approximating the parameter values. We provably demonstrate that there is a significant gap between these two approaches for ReLU networks, particularly as the desired rank of the approximating weights decreases, or as the dimension of the inputs to the layer increases (the latter point holds when the network width is super-linear in dimension). Along the way, we provide the first provably efficient algorithm for solving the ReLU low-rank approximation problem for fixed parameter rank $r$ -- previously, it was unknown that the problem was computationally tractable to solve even for rank $1$. We also provide a practical algorithm to solve this problem which is no more expensive than the existing spectral initialization approach, and validate our theory by training ResNet and EfficientNet models (He et al., 2016; Tan & Le, 2019) on ImageNet (Russakovsky et al., 2015).
    Semi-self-supervised Automated ICD Coding. (arXiv:2205.10088v1 [cs.CL])
    Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation of a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of data availability to the physician. Our data augmentation method shows a significant positive effect which is diminished when clinical features from the examination of the patient and diagnostics are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.
    A New Central Limit Theorem for the Augmented IPW Estimator: Variance Inflation, Cross-Fit Covariance and Beyond. (arXiv:2205.10198v1 [math.ST])
    Estimation of the average treatment effect (ATE) is a central problem in causal inference. In recent times, inference for the ATE in the presence of high-dimensional covariates has been extensively studied. Among the diverse approaches that have been proposed, augmented inverse probability weighting (AIPW) with cross-fitting has emerged as a popular choice in practice. In this work, we study this cross-fit AIPW estimator under well-specified outcome regression and propensity score models in a high-dimensional regime where the number of features and samples are both large and comparable. Under assumptions on the covariate distribution, we establish a new CLT for the suitably scaled cross-fit AIPW that applies without any sparsity assumptions on the underlying high-dimensional parameters. Our CLT uncovers two crucial phenomena among others: (i) the AIPW exhibits a substantial variance inflation that can be precisely quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the pre-cross-fit estimates is non-negligible even on the root-n scale. In fact, these cross-covariances turn out to be negative in our setting. These findings are strikingly different from their classical counterparts. On the technical front, our work utilizes a novel interplay between three distinct tools--approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. We believe our proof techniques should be useful for analyzing other two-stage estimators in this high-dimensional regime. Finally, we complement our theoretical results with simulations that demonstrate both the finite sample efficacy of our CLT and its robustness to our assumptions.
    Instagram Fake and Automated Account Detection. (arXiv:1910.03090v3 [cs.IR] UPDATED)
    Fake engagement is one of the significant problems in Online Social Networks (OSNs) which is used to increase the popularity of an account in an inorganic manner. The detection of fake engagement is crucial because it leads to loss of money for businesses, wrong audience targeting in advertising, wrong product predictions systems, and unhealthy social network environment. This study is related with the detection of fake and automated accounts which leads to fake engagement on Instagram. Prior to this work, there were no publicly available dataset for fake and automated accounts. For this purpose, two datasets have been published for the detection of fake and automated accounts. For the detection of these accounts, machine learning algorithms like Naive Bayes, Logistic Regression, Support Vector Machines and Neural Networks are applied. Additionally, for the detection of automated accounts, cost sensitive genetic algorithm is proposed to handle the unnatural bias in the dataset. To deal with the unevenness problem in the fake dataset, Smote-nc algorithm is implemented. For the automated and fake account detection datasets, 86% and 96% classification accuracies are obtained, respectively.
    Remember and Forget Experience Replay for Multi-Agent Reinforcement Learning. (arXiv:2203.13319v2 [cs.LG] UPDATED)
    We present the extension of the Remember and Forget for Experience Replay (ReF-ER) algorithm to Multi-Agent Reinforcement Learning (MARL). {ReF-ER} was shown to outperform state of the art algorithms for continuous control in problems ranging from the OpenAI Gym to complex fluid flows. In MARL, the dependencies between the agents are included in the state-value estimator and the environment dynamics are modeled via the importance weights used by ReF-ER. In collaborative environments, we find the best performance when the value is estimated using individual rewards and we ignore the effects of other actions on the transition map. We benchmark the performance of ReF-ER MARL on the Stanford Intelligent Systems Laboratory (SISL) environments. We find that employing a single feed-forward neural network for the policy and the value function in ReF-ER MARL, outperforms state of the art algorithms that rely on complex neural network architectures.
    On Calibration of Ensemble-Based Credal Predictors. (arXiv:2205.10082v1 [stat.ML])
    In recent years, several classification methods that intend to quantify epistemic uncertainty have been proposed, either by producing predictions in the form of second-order distributions or sets of probability distributions. In this work, we focus on the latter, also called credal predictors, and address the question of how to evaluate them: What does it mean that a credal predictor represents epistemic uncertainty in a faithful manner? To answer this question, we refer to the notion of calibration of probabilistic predictors and extend it to credal predictors. Broadly speaking, we call a credal predictor calibrated if it returns sets that cover the true conditional probability distribution. To verify this property for the important case of ensemble-based credal predictors, we propose a novel nonparametric calibration test that generalizes an existing test for probabilistic predictors to the case of credal predictors. Making use of this test, we empirically show that credal predictors based on deep neural networks are often not well calibrated.
    Why GANs are overkill for NLP. (arXiv:2205.09838v1 [cs.LG])
    This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as popular for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely artificial and only holds for limited models. We argue that minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered.  ( 2 min )
    Conformal Prediction with Temporal Quantile Adjustments. (arXiv:2205.09940v1 [stat.ML])
    We develop Temporal Quantile Adjustment (TQA), a general method to construct efficient and valid prediction intervals (PIs) for regression on cross-sectional time series data. Such data is common in many domains, including econometrics and healthcare. A canonical example in healthcare is predicting patient outcomes using physiological time-series data, where a population of patients composes a cross-section. Reliable PI estimators in this setting must address two distinct notions of coverage: cross-sectional coverage across a cross-sectional slice, and longitudinal coverage along the temporal dimension for each time series. Recent works have explored adapting Conformal Prediction (CP) to obtain PIs in the time series context. However, none handles both notions of coverage simultaneously. CP methods typically query a pre-specified quantile from the distribution of nonconformity scores on a calibration set. TQA adjusts the quantile to query in CP at each time $t$, accounting for both cross-sectional and longitudinal coverage in a theoretically-grounded manner. The post-hoc nature of TQA facilitates its use as a general wrapper around any time series regression model. We validate TQA's performance through extensive experimentation: TQA generally obtains efficient PIs and improves longitudinal coverage while preserving cross-sectional coverage.
    Bounding the Effects of Continuous Treatments for Hidden Confounders. (arXiv:2204.11206v2 [stat.ME] UPDATED)
    Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified dose-response curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with simulations and three real-world observational studies that our approach can give non-trivial bounds on causal effects from continuous treatments in the presence of hidden confounders.
    Almost exact recovery in noisy semi-supervised learning. (arXiv:2007.14717v3 [cs.LG] UPDATED)
    Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the Maximum A Posteriori (MAP) estimator for clustering a Degree Corrected Stochastic Block Model (DC-SBM) when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
    Bootstrapping the error of Oja's algorithm. (arXiv:2106.14857v2 [math.ST] UPDATED)
    We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution. By combining classical tools from the U-statistics literature with recent results on high-dimensional central limit theorems for quadratic forms of random vectors and concentration of matrix products, we establish a weighted $\chi^2$ approximation result for the $\sin^2$ error between the population eigenvector and the output of Oja's algorithm. Since estimating the covariance matrix associated with the approximating distribution requires knowledge of unknown model parameters, we propose a multiplier bootstrap algorithm that may be updated in an online manner. We establish conditions under which the bootstrap distribution is close to the corresponding sampling distribution with high probability, thereby establishing the bootstrap as a consistent inferential method in an appropriate asymptotic regime.
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v1 [stat.ML])
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v1 [cs.LG])
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.
    A General Framework for quantifying Aleatoric and Epistemic uncertainty in Graph Neural Networks. (arXiv:2205.09968v1 [cs.LG])
    Graph Neural Networks (GNN) provide a powerful framework that elegantly integrates Graph theory with Machine learning for modeling and analysis of networked data. We consider the problem of quantifying the uncertainty in predictions of GNN stemming from modeling errors and measurement uncertainty. We consider aleatoric uncertainty in the form of probabilistic links and noise in feature vector of nodes, while epistemic uncertainty is incorporated via a probability distribution over the model parameters. We propose a unified approach to treat both sources of uncertainty in a Bayesian framework, where Assumed Density Filtering is used to quantify aleatoric uncertainty and Monte Carlo dropout captures uncertainty in model parameters. Finally, the two sources of uncertainty are aggregated to estimate the total uncertainty in predictions of a GNN. Results in the real-world datasets demonstrate that the Bayesian model performs at par with a frequentist model and provides additional information about predictions uncertainty that are sensitive to uncertainties in the data and model.
    Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization. (arXiv:2205.10217v1 [stat.ML])
    The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $\Omega(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $\Omega(N)$ and, hence, the number of neurons is as little as $\Omega(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.
    Generating Semantic Adversarial Examples via Feature Manipulation. (arXiv:2001.02297v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks to adversarial attacks has been widely demonstrated (e.g., adversarial example attacks). Traditional attacks perform unstructured pixel-wise perturbation to fool the classifier. An alternative approach is to have perturbations in the latent space. However, such perturbations are hard to control due to the lack of interpretability and disentanglement. In this paper, we propose a more practical adversarial attack by designing structured perturbation with semantic meanings. Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes. The intuition behind our technique is that images in similar domains have some commonly shared but theme-independent semantic attributes, e.g. thickness of lines in handwritten digits, that can be bidirectionally mapped to disentangled latent codes. We generate adversarial perturbation by manipulating a single or a combination of these latent codes and propose two unsupervised semantic manipulation approaches: vector-based disentangled representation and feature map-based disentangled representation, in terms of the complexity of the latent codes and smoothness of the reconstructed images. We conduct extensive experimental evaluations on real-world image data to demonstrate the power of our attacks for black-box classifiers. We further demonstrate the existence of a universal, image-agnostic semantic adversarial example.  ( 2 min )
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v3 [stat.ML] UPDATED)
    Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on a decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.  ( 2 min )
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v3 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.  ( 2 min )
    EXODUS: Stable and Efficient Training of Spiking Neural Networks. (arXiv:2205.10242v1 [cs.NE])
    Spiking Neural Networks (SNNs) are gaining significant traction in machine learning tasks where energy-efficiency is of utmost importance. Training such networks using the state-of-the-art back-propagation through time (BPTT) is, however, very time-consuming. Previous work by Shrestha and Orchard [2018] employs an efficient GPU-accelerated back-propagation algorithm called SLAYER, which speeds up training considerably. SLAYER, however, does not take into account the neuron reset mechanism while computing the gradients, which we argue to be the source of numerical instability. To counteract this, SLAYER introduces a gradient scale hyperparameter across layers, which needs manual tuning. In this paper, (i) we modify SLAYER and design an algorithm called EXODUS, that accounts for the neuron reset mechanism and applies the Implicit Function Theorem (IFT) to calculate the correct gradients (equivalent to those computed by BPTT), (ii) we eliminate the need for ad-hoc scaling of gradients, thus, reducing the training complexity tremendously, (iii) we demonstrate, via computer simulations, that EXODUS is numerically stable and achieves a comparable or better performance than SLAYER especially in various tasks with SNNs that rely on temporal features. Our code is available at https://github.com/synsense/sinabs-exodus.  ( 2 min )
    B-cos Networks: Alignment is All We Need for Interpretability. (arXiv:2205.10268v1 [cs.CV])
    We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transforms in DNNs by our B-cos transform. As we show, a sequence (network) of such transforms induces a single linear transform that faithfully summarises the full model computations. Moreover, the B-cos transform introduces alignment pressure on the weights during optimisation. As a result, those induced linear transforms become highly interpretable and align with task-relevant features. Importantly, the B-cos transform is designed to be compatible with existing architectures and we show that it can easily be integrated into common models such as VGGs, ResNets, InceptionNets, and DenseNets, whilst maintaining similar performance on ImageNet. The resulting explanations are of high visual quality and perform well under quantitative metrics for interpretability. Code available at https://www.github.com/moboehle/B-cos.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v3 [cs.LG] UPDATED)
    A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communication cost, which is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on Stack Overflow next-word prediction, a realistic and challenging FL benchmark. This is achieved by examining the problem using rate-distortion theory, and proposing distortion as a reliable proxy for model accuracy. Distortion can be more effectively used for optimizing the trade-off between model performance and communication cost across clients. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds.
    Sparse Infinite Random Feature Latent Variable Modeling. (arXiv:2205.09909v1 [stat.ML])
    We propose a non-linear, Bayesian non-parametric latent variable model where the latent space is assumed to be sparse and infinite dimensional a priori using an Indian buffet process prior. A posteriori, the number of instantiated dimensions in the latent space is guaranteed to be finite. The purpose of placing the Indian buffet process on the latent variables is to: 1.) Automatically and probabilistically select the number of latent dimensions. 2.) Impose sparsity in the latent space, where the Indian buffet process will select which elements are exactly zero. Our proposed model allows for sparse, non-linear latent variable modeling where the number of latent dimensions is selected automatically. Inference is made tractable using the random Fourier approximation and we can easily implement posterior inference through Markov chain Monte Carlo sampling. This approach is amenable to many observation models beyond the Gaussian setting. We demonstrate the utility of our method on a variety of synthetic, biological and text datasets and show that we can obtain superior test set performance compared to previous latent variable models.
    Lifelong Neural Predictive Coding: Learning Cumulatively Online without Forgetting. (arXiv:1905.10696v3 [cs.LG] UPDATED)
    In lifelong learning systems based on artificial neural networks, one of the biggest obstacles is the inability to retain old knowledge as new information is encountered. This phenomenon is known as catastrophic forgetting. In this paper, we propose a new kind of connectionist architecture, the Sequential Neural Coding Network, that is robust to forgetting when learning from streams of data points and, unlike networks of today, does not learn via the popular back-propagation of errors. Grounded in the neurocognitive theory of predictive processing, our model adapts synapses in a biologically-plausible fashion while another neural system learns to direct and control this cortex-like structure by mimicking some of task-executive control functionality of the basal ganglia. In our experiments, we demonstrate that our self-organizing system experiences significantly less forgetting compared to standard neural models, outperforming a swath of previously proposed methods, including rehearsal/data buffer-based methods, on both standard (SplitMNIST, Split Fashion MNIST, etc.) and custom benchmarks even though it is trained in a stream-like fashion. Our work offers evidence that emulating mechanisms in real neuronal systems, e.g., local learning, lateral competition, can yield new directions for tackling the grand challenge of lifelong machine learning.
    On Algorithmic Stability in Unsupervised Representation Learning. (arXiv:2106.05238v3 [cs.LG] UPDATED)
    In this paper, we investigate the algorithmic stability of unsupervised representation learning with deep generative models, as a function of repeated re-training on the same input data. Algorithms for learning low dimensional linear representations -- for example principal components analysis (PCA), or linear independent components analysis (ICA) -- come with guarantees that they will always reveal the same latent representations (perhaps up to an arbitrary rotation or permutation). Unfortunately, for non-linear representation learning, such as in a variational auto-encoder (VAE) model trained by stochastic gradient descent, we have no such guarantees. Recent work on identifiability in non-linear ICA have introduced a family of deep generative models that have identifiable latent representations, achieved by conditioning on side information (e.g. informative labels). We empirically evaluate the stability of these models under repeated re-estimation of parameters, and compare them to both standard VAEs and deep generative models which learn to cluster in their latent space. Surprisingly, we discover side information is not necessary for algorithmic stability: using standard quantitative measures of identifiability, we find deep generative models with latent clusterings are empirically identifiable to the same degree as models which rely on auxiliary labels. We relate these results to the possibility of identifiable non-linear ICA.
    Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores. (arXiv:2202.01748v2 [stat.ME] UPDATED)
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset.
    The Fairness of Credit Scoring Models. (arXiv:2205.10200v1 [stat.ML])
    In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they also often discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. In this paper, we show how (1) to test whether there exists a statistically significant difference between protected and unprotected groups, which we call lack of fairness and (2) to identify the variables that cause the lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, and improved for the benefit of protected groups.
    Counterfactual Temporal Point Processes. (arXiv:2111.07603v2 [cs.LG] UPDATED)
    Machine learning models based on temporal point processes are the state of the art in a wide variety of applications involving discrete events in continuous time. However, these models lack the ability to answer counterfactual questions, which are increasingly relevant as these models are being used to inform targeted interventions. In this work, our goal is to fill this gap. To this end, we first develop a causal model of thinning for temporal point processes that builds upon the Gumbel-Max structural causal model. This model satisfies a desirable counterfactual monotonicity condition, which is sufficient to identify counterfactual dynamics in the process of thinning. Then, given an observed realization of a temporal point process with a given intensity function, we develop a sampling algorithm that uses the above causal model of thinning and the superposition theorem to simulate counterfactual realizations of the temporal point process under a given alternative intensity function. Simulation experiments using synthetic and real epidemiological data show that the counterfactual realizations provided by our algorithm may give valuable insights to enhance targeted interventions.
    The Bayesian Context Trees State Space Model: Interpretable mixture models for time series. (arXiv:2106.03023v3 [stat.ME] UPDATED)
    A general hierarchical Bayesian framework is introduced for mixture modelling of real-valued time series, including a collection of effective tools for learning and inference. At the top level, a discrete context (or `state') is extracted for each sample, consisting of a discretised version of some of the most recent observations preceding it. The set of all relevant contexts are represented as a discrete context tree. At the bottom level, a different real-valued time series model is associated with each context (i.e., with each state). This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. We introduce algorithms that allow for efficient, exact Bayesian inference; in particular, the maximum a posteriori probability (MAP) model, including the relevant MAP context tree, can be identified exactly. These algorithms can be updated sequentially, facilitating efficient online forecasting. The utility of the general framework is illustrated in detail when autoregressive (AR) models are used at the bottom level, resulting in a nonlinear AR mixture model. Our methods are found to outperform several state-of-the-art techniques on both simulated and real-world data from economics and finance, both in terms of forecasting accuracy and computational requirements.
    Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome. (arXiv:2205.09906v1 [stat.ML])
    Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v1 [cs.LG])
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{mei2018mean, chizat2018global}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated nonlinearity of the training dynamics. This work establishes a new linear convergence result for two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novelty logarithmic Sobolev inequality for two-layer neural networks, and uniform upper bounds on the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.
    Graph Neural Networks Are More Powerful Than we Think. (arXiv:2205.09801v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that the WL algorithm is the upper bound only when the input to the GNN is the vector of all ones. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We show that GNNs can distinguish between any graphs that differ in at least one eigenvalue and design simple GNN architectures that are provably more expressive than the WL algorithm. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed architectures.
    What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment. (arXiv:2205.10327v1 [stat.ME])
    The fundamental problem of causal inference -- that we never observe counterfactuals -- prevents us from identifying how many might be negatively affected by a proposed intervention. If, in an A/B test, half of users click (or buy, or watch, or renew, etc.), whether exposed to the standard experience A or a new one B, hypothetically it could be because the change affects no one, because the change positively affects half the user population to go from no-click to click while negatively affecting the other half, or something in between. While unknowable, this impact is clearly of material importance to the decision to implement a change or not, whether due to fairness, long-term, systemic, or operational considerations. We therefore derive the tightest-possible (i.e., sharp) bounds on the fraction negatively affected (and other related estimands) given data with only factual observations, whether experimental or observational. Naturally, the more we can stratify individuals by observable covariates, the tighter the sharp bounds. Since these bounds involve unknown functions that must be learned from data, we develop a robust inference algorithm that is efficient almost regardless of how and how fast these functions are learned, remains consistent when some are mislearned, and still gives valid conservative bounds when most are mislearned. Our methodology altogether therefore strongly supports credible conclusions: it avoids spuriously point-identifying this unknowable impact, focusing on the best bounds instead, and it permits exceedingly robust inference on these. We demonstrate our method in simulation studies and in a case study of career counseling for the unemployed.
    Causal Discovery and Injection for Feed-Forward Neural Networks. (arXiv:2205.09787v1 [cs.LG])
    Neural networks have proven to be effective at solving a wide range of problems but it is often unclear whether they learn any meaningful causal relationship: this poses a problem for the robustness of neural network models and their use for high-stakes decisions. We propose a novel method overcoming this issue by injecting knowledge in the form of (possibly partial) causal graphs into feed-forward neural networks, so that the learnt model is guaranteed to conform to the graph, hence adhering to expert knowledge. This knowledge may be given up-front or during the learning process, to improve the model through human-AI collaboration. We apply our method to synthetic and real (tabular) data showing that it is robust against noise and can improve causal discovery and prediction performance in low data regimes.
    Breaking the $\sqrt{T}$ Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits. (arXiv:2205.09899v1 [stat.ML])
    We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.  ( 2 min )

  • Open

    What do you think of this approach to help avoid overfitting? (Playing with episode length, start and end)
    Hi all, I'm working on applying RL to a time series environment. I have a limited (and relatively small, only ~30K rows) dataset for the agent to work through, and the data cannot be simulated. So basically, agent takes a step, and the new observations/state is determined by the next row of the tabular dataset. I have reached a point of overfitting - in-sample performance of the RL agent continues increasing, but OOS performance is decreasing. I assume a big part of the reason for this is that the agent is somehow "memorizing" its training data. Here's the thing. At the moment, I am using ALL of the training data as one episode (so once it completes one episode having gone through ALL of the data, it restarts). What if I made an episode length equal to, say, 1/10th of my data and then I start each episode at a random point in the time series? I am thinking that this way, I can have some kind of "randomness" in my otherwise deterministic environment, and perhaps this way I can force the agent to not "memorize" the training data? I am new to RL, so any feedback on this general idea would be greatly appreciated! ​ EDIT: I have just tried this out, and while it's hard to tell (because of random fluctuations), using this technique DOES appear to have positive effect on the out-of-sample performance! Again though, thoughts, ideas and feedback greatly appreciated. submitted by /u/VladimirB-98 [link] [comments]  ( 1 min )
    Need help with PettingZoo
    Hello everyone, I am working on a custom environment using PettingZoo for multi agent reinforcement learning. I managed to finish the environment, however now I don't know how to create the model. I also trained to create a model using stable_baselines3 PPO, however i get the error "AttributeError: 'functools._lru_cache_wrapper' object has no attribute 'shape'". I tried to follow the example from here, but I cannot install supersuit with pip on windows 10 and I am not sure if it is needed. Any help, be it examples or advice, will be very much appreciated. submitted by /u/Iltavil [link] [comments]  ( 1 min )
    can reinforcement learning be used for unlabled time series data clustering?
    submitted by /u/Affectionate_Worth43 [link] [comments]  ( 1 min )
    How should one interpret these PPO diagnostic training plots?
    ​ https://preview.redd.it/9in8206mu0191.png?width=1453&format=png&auto=webp&s=e221eead27f4e781a13e34586bc3ad87d13e7810 So here I have four diagnostic plots for PPO training on a Gym CustomEnv. I have many questions regarding how to interpret them. Although, if there is anything you guys think is interesting/insightful regarding these graphs I would love to hear them. Also, it might be useful to know that the training was indeed successful for this run, and the mean episode reward was (more or less) consistently improving. 1) What does an increasing clip fraction indicate? 2) What does an increasing KL divergence indicate? 3) Why the policy gradient loss go above 0? Wouldn't this mean that the policy should be getting worse? In this case the policy continues to improve even after getting this positive loss. 4) Same as question 3 but for entropy loss. Any help whatsoever will be great. Im quite at a loss. Thanks. submitted by /u/C_BearHill [link] [comments]  ( 1 min )
  • Open

    [D] matrix profile distance measure characterization
    If there are various types of distances measures for time series, such as Euclidean, DTW, and shape-based ones, how can we characterize the matrix profile distance measure? Profiling one? submitted by /u/jiii95 [link] [comments]
    How to create a basic version of DALL-E from scratch? [P]
    I've been searching online for tutorial on how to create a text-to-image generator algorithm (a very basic version of DALL-E). I'd like to create the algorithm line by line rather than just copy some code from online & just train that on my images. Basically, I'm trying to learn how these sorts of algorithms work by coding one from scratch. So, does anyone know of any video or online article tutorials for creating and training a basic text-to-image generator with python? Thanks! submitted by /u/Special_Treacle4452 [link] [comments]  ( 1 min )
    [P] Gradio Demo for "Story and Video Generation" using GPT-J, Latent Diffusion, and FILM
    submitted by /u/Shikanomiya [link] [comments]
    This is how you can turn yourself into an immortal machine. [N]
    submitted by /u/Defiant_Swann [link] [comments]
    [r] How to train a neural network with a loss function based on cumulative AUC of multiple inputs
    Hello, I have almost no experience with neural networks, besides doing a few examples in books. Basically the problem is I have 1000 cases of which 100 are true and 900 are false. Each case has a score associated with it composed of multiple weighted sub-scores (Overall Score = W1 * Score 1 + W2 * Score 2 ... + WN * Score N) The cases are then sorted on their overall score and a ROC curve is generated, by tracking the running percent of positives identified vs percent of negatives identified as you follow through the rank ordered list of scores. I want to create a single set of input weights for sub scores which when applied to all test cases produces the maximal AUC. Any input is appreciated as I have no idea where to start. Thanks! submitted by /u/Fckcensorship25 [link] [comments]  ( 1 min )
    [D] Cheaper gpu cloud alternative to gradient (paperspace)
    Are there any cheaper subscription models like gradient? currently i pay 10 dollars a month with access to rtx 5000 or p5000 for 6 hours at a time. after which you will have to restart the instance, but the data is saved. are there any better alternatives to this? Also, I'm student in case there is a student discount. submitted by /u/Knightron2525 [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 138
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 1…  ( 1 min )
    [P] PyTorch M1 GPU benchmark update including M1 Pro, M1 Max, and M1 Ultra after fixing the memory leak
    If someone is curious, I updated the benchmarks after the PyTorch team fixed the memory leak in the latest nightly release May 21->22. The results are quite improved: ​ https://preview.redd.it/5dkat9hoi3191.png?width=2637&format=png&auto=webp&s=dc42ee03167dd3aefbd0319061994bfc2ff24dab For a more detailed write-up please see https://sebastianraschka.com/blog/2022/pytorch-m1-gpu.html submitted by /u/seraschka [link] [comments]  ( 1 min )
    [D] Generating fake research papers with GPT-3
    Hi! I wanted to generate a fake research papers with GPT-3. I'm not sure how to do it in a cost-efficient way: should I fine tune the model with custom examples? Since I am looking to generate the entire paper, I'm not sure how to maintain the context in different parts of the research paper (abstract, introduction, ....., conclusion). I was thinking of two paths: (1) Take an existing research paper. Feed it into GPT-3 to reword the entire research paper. (2) Take the title of an existing research paper. Make GPT-3 generate the abstract. Use the abstract to generate the introduction. Use the introduction to generate the description, and so on. I was planning to fine tune the GPT-3 model by breaking down existing research papers into (prompt: title, prompt: rest of the paper) Is there a better way to generate entire fake research papers? submitted by /u/mimeticaware [link] [comments]  ( 1 min )
    [D] Explainable AI
    Does anyone know papers, where experts are asked to interpret the results of explanation algorithms like SHAP and LIME? In the best case on timeseries data, where the explanation algorithm highlights which time points lead to decisions of models. submitted by /u/Bananymous97 [link] [comments]  ( 1 min )
    [R] HYDRA is the first spatial perception engine that builds a 3D scene graph (geometry and semantics) from sensor data in realtime
    submitted by /u/SpatialComputing [link] [comments]  ( 2 min )
    [D] EMNLP 2022 and ARR June 1
    Hi there! We are planning to submit a paper to the June 1 ARR deadline and commit to EMNLP 2022 later. I wonder if reviews will be out before July 24, the ARR-to-EMNLP commitment deadline. Are you submitting to EMNLP directly or via ARR this year? ​ Thanks in advance! Wishing you all the luck with paper submissions! submitted by /u/bunsenfeng [link] [comments]  ( 1 min )
    [D] which of these is a better ml strategy
    Using R here - set seed, divide data into train and test only once. Use repeated cross validation on the train data to select the best model out of all the algorithms you're interested in. Finally - use the best performing model on the test data and report the performance metrics. Divide the data into train and test say 10 times by changing the seed. Train model on each train data and report the performance metrics via predict function on each of the test data - now use the performance metrics of the test data to choose the model?? I am trying these strategies because I have very limited 2 sets of biological data with approx n × p ratio of 1 : 7 and 2:7. So the performance when changing the seeds on the test set changes drastically like R2 values varies from negative to 0.50 submitted by /u/triary95 [link] [comments]  ( 2 min )
    [D] GNN Architecture that inputs and outputs both edge and node features?
    I am looking for a GNN architecture/layer that takes in both the node and edge features and outputs both node and edge features too. The only one I could find is the "MetaLayer" in torch geometric from the paper "Relational inductive biases, deep learning, and graph networks" but that is from 2018. There must be something else and new out there. Any links to papers and/or implementations would be appreciated! submitted by /u/Hobo-Wizzard [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [R][P] A Python package for unsupervised mix data types clustering
    Problem statement Clustering is an unsupervised machine learning approach for clustering unlabeled data. Unfortunately, most clustering methods can only operate with data that has either numeric or category properties. This is a significant issue since most real-world datasets will contain multiple types of features. Solution Today I am releasing the first version of mixclu package for mix data clustering. Please check it out and give it a star if you like it.Mixclu (Mix Clustering ) is an open-source python package for doing unsupervised mix data types clustering. This package puts together a variety of combination models, including kmeans-one-hot, Gower distance, umap, etc. The goal is to provide an easy-to-use implementation for each algorithm. The package is still in progress, and I plan to add deep learning-based methods such as autoencoder and other techniques. I also plan to implement a few papers based on mixed clustering problem statements.Package link: https://github.com/monk1337/Mixclu Actively Looking for contribution in the following task : Implement paper: Affinity Learning for Mixed Data Clustering ( I do have Matlab code & data from the article's author. If anyone wants to contribute and convert the Matlab to python code, please feel free to contact me. Implement paper: A Multi-View Clustering for Mixed Data ( I do have java code & data from the author of the article, please feel free to contact me for contribution ) More features and suggestions are welcome :) submitted by /u/aadityaura [link] [comments]  ( 1 min )
    [Discussion] Name of a (possibly non-existent) paper stating vision-language pretraining can improve performance on text-only tasks?
    I'm trying to remember the name of a paper I think I read, though there's a chance this paper doesn't exist and I'm just mixing it up with something else. In the off-chance I'm remembering it correctly, I think it trained a vision-language model and showed that it outperformed a unimodal text-only model on language tasks? So the idea of the paper was vision-language pretraining doesn't just help in vision-language tasks, but actually improves performance on text-only tasks. If this paper is real, does anyone remember what it was called or how I can find it? submitted by /u/BurnerAccount1100 [link] [comments]  ( 1 min )
    [D] machine learning on sequential data with gaps by design
    hi, in my research, we’re doing binary classification on sequential data. some of the sequences have gaps, or start and end at different points, and this is intentional, since the values are normalized flux at different wavelengths and many points are missing due to absorption features. it is important to know where the missingness occurs in each sequence. essentially, the gaps are intentional and this has created a lot of inconsistency and variance in the dataset, but the points can’t be interpolated. i’ve kind of hit a wall after attempting to do manual feature extraction (how scattered the non-missing data are sat different wavelengths,whether there is a large gap in the middle, etc.). how should i handle this missing-by-design data? is it possible to pad the sequences or fill in the gaps with -99? will a standard RNN still work? thank you in advance. submitted by /u/queenofthekyriarchy [link] [comments]  ( 1 min )
    [D] Clustering high dimensional
    I have a dataset containing multiple cancer cell lines (rows). The dataset features/columns are different genes, where each gene has a value based on their PDUI (polyadenylation site usage index) score (between 0 and 1). The dataset has 53 rows (cell lines) and 12,500 columns (genes). For each cell line, I also have an IC50 value (taken from a different dataset) which shows resistance/sensitivity to an anti-cancer drug being developed. I would like to cluster the cell lines considering all their genes' PDUI scores and then create a plot against IC50, to see if there are any relationships. A problem is that there are some genes where the majority of the cancer cell lines (some cases 99%) have missing-N/A values (as they do not have that gene). This leads to a large portion of the dataset being filled with N/A values. To deal with that I drop every column where more than 20% of the samples do not have a value. For the remaining columns (20% or less N/A values), I use a KNN imputer to fill out their missing values based on their 5 nearest neighbours. This results in approximately 6,000 columns/features remaining with no missing values. What algorithm do you think will perform best for clustering such a high-dimensional dataset? How would you approach such a task? I have of course tried to reduce the dimensions using PCA, reducing the dataset to just 50 features (whilst keeping a 98.8% explained variance ratio). I then tried using HDBSCAN but the algorithm failed to find any clusters. I am open to any suggestions and would greatly appreciate any discussion! submitted by /u/Rafaelkoll [link] [comments]  ( 5 min )
    [P] GPT-NeoX Playground
    We deployed 20B parameter GPT-NeoX model by #EleutherAI for anyone who wants to try it out. (some features are disabled on mobile). GPT-NeoX is 20 Billion parameter language model, which is open-sourced by EleutherAI. Link: https://neox.labml.ai/main Features that I like to highlight here, You can pick random sampling, nucleus sampling, greedy sampling or a beam search. Hover over tokens to see alternatives and the predicted probabilities. Outputs are shown as it receives in the UI. You can find our simple PyTorch annotated re-implementation of GPT-NeoX here. We love to hear your feedback and suggestions. Thank you all, and I appreciate the support. submitted by /u/hnipun [link] [comments]  ( 1 min )
    [P] TF\Keras + GA: is there a "best" one?
    i use tF\Keras + "KerasGA (PyGad)" for 10-30 parameter (climate) models, up to now with not so much success. So I want to try other (Python -based) GA packages, like "Evolutionary Keras" and "Keras+DEAP"... My questions: Are there other powerful GA packages for Keras, which I should try? Is there something like a "best" package for complex, "difficult" problems? submitted by /u/rudel_s [link] [comments]  ( 1 min )
    [D] Why is the skip-gram model used in DeepWalk and Node2vec rather than CBOW?
    I am curious to understand the reason behind the decision to use Skip-gram rather than CBOW for these two models. According to the original Word2vec paper, CBOW is faster to train and captures syntactic similarities better whereas the skip-gram is slower at training but captures more robust semantic similarities and is also better at handling infrequent words. How does this apply to graph theory and what motivated this decision? DeepWalk: https://dl.acm.org/doi/abs/10.1145/2623330.2623732 Node2vec: https://dl.acm.org/doi/abs/10.1145/2939672.2939754 submitted by /u/LanverYT [link] [comments]  ( 1 min )
    [R] [P] Train dataset wih similar objects too close. Do I have to keep only one to detect with bounding boxes?
    Hi guys, I have a training data with many similar objects close together at the scene, the objects are too wide, then if I remove the others from the train image keeping only one, the train image will be too wide then. I'm asking because I will use bounding boxes to recognize the object to count pourpose. Can I keep only one image and fill the rest of the image with black to maintain a square aspect ratio? Probably I willl have to do some data augmentation too (rotating). Thanks! submitted by /u/MrWrodgy [link] [comments]  ( 2 min )
    [R] Happy to share our latest Research paper : MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
    I am pleased to share the news with the ML community that the team I work for recently released a new benchmark: MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. Our paper was accepted at Conference on Health, Inference, and Learning (CHIL) 2022 and published in Proceedings of Machine Learning Research (PMLR). ​ MedMCQA sample questions ​ The main contributions: ​ MedMCQA has More than 194k high-quality Medical entrance exam MCQs. Dataset requires a deeper domain and language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. It Covers 2.4k healthcare topics and 21 medical subjects with an average token length of 12.77, the …  ( 2 min )
  • Open

    Apple’s former ML director reportedly joins Google DeepMind
    submitted by /u/KendraMontgomery [link] [comments]
    Hack3: The Leading Online Hackathon for High Schoolers!
    ​ https://preview.redd.it/lmy70wdji3191.png?width=1270&format=png&auto=webp&s=a7737c6828ea94682d0dcef58f7afb391bcba0d0 Attention to curious high schoolers! Hack3 is hosting an online hackathon for high schoolers for 24 hours on June 25-26. In 2020, we connected nearly 300 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 100 attended our free classes led by industry professionals to learn new skills . Over twenty mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Stanford, Harvard, Amazon, NetApp, and Wikipedia. In 2021, we connected over 350 students of all skill levels, to learn to build innovative projects that positively impacted the world. Over 150 attended our free classes led by industry professionals to learn new skills. Over 30 mentors were in our help desk to help participants when they needed help. Last year, our judges, mentors, and workshop instructors were affiliated with the likes of Amazon, NetApp, Balsamiq, Nexus Bytes, Replit, Postman, and Wolfram Language. This year, with the lessons learned from 2021, we aim to host a competition consisting of over 500 participants, while targeting the underprivileged communities around the world. To help achieve our goal of providing a learning opportunity for everyone, we will be sponsoring internet access to those who need it to truly level the playing field for all. Are you down? Register on DevPost here. submitted by /u/thegreatestgemini [link] [comments]  ( 1 min )
    General AI through scaling? Meta's AI chief Yann LeCun speaks out: "We have a number of obstacles to clear, and we don’t know how."
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    AI book reading tool
    Greetings Folks, In the past year, we had released a book reading AI tool to search for content within files using natural search, and we had received constructive feedback from the community. Today we are releasing, a new updated version with a fresh UI overhaul (desktop support).https://rastero.io/intro Glad to share it with you all. Here's an account to try it out! username: reddit password: reddit2022 https://reddit.com/link/uvhgx5/video/7b49kj1jq2191/player submitted by /u/deep_ak [link] [comments]  ( 1 min )
    HYDRA is the first spatial perception engine that builds a 3D scene graph (geometry and semantics) from sensor data in realtime
    submitted by /u/SpatialComputing [link] [comments]  ( 1 min )
    10 Ways AI Will Change The World By 2050
    By 2050, AI will reach remarkable advancements that will be beyond many people's wildest dreams. Robots will not only be able to attain, but also generate, that task in a cost-effective, timely, and meticulous manner, hence increasing efficiency. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Should AI Be Democratized?
    AI is considered a “general-purpose technology” due to its pervasive role in various industries. Many researchers argue that AI should be employed just in highly-impact domains, while others advocate the consensus to democratize AI so that the benefits are not restricted to a small group of people. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    AI Dream 53 - Cosmic Creation | MASTERPIECE TEASER
    submitted by /u/LordPewPew777 [link] [comments]
    Is there a AI which is able to combine 2 or more images?
    Or is there a AI which I can use to combine 2 images: like a dragon from one image and a background from a other images etc.? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    How to Control an AGI via Motivation Selection
    My Dear AI Fellows, Please check out my latest video about how to control an AGI via Motivation Selection: https://youtu.be/rLB4xkwgEAw I also have a lot of great content on the channel regarding life 3.0, building an AGI, AGI Safety, etc. Please check them out and subscribe to my channel! submitted by /u/billgggggg [link] [comments]
    Greg Brockman on Twitter - I also really like this one, created by the prompt "DALL-E dreaming of becoming an AGI"
    submitted by /u/mofosyne [link] [comments]  ( 1 min )
    Building an ACOG (part 1) (artificial cognitive entity)
    submitted by /u/DavidKShapiro [link] [comments]
    2029 AGI really tho?
    Tell me, is that far fetched? I had an argument with a family member about when AGI arrives and if it even arrives at all. Is 2029 a reasonable date? I think so, but I need some backup submitted by /u/Ashamed-Asparagus-93 [link] [comments]  ( 2 min )
    For games - A lot of Artificial Intelligences are just really deep exploration trees. Not learning anything intuitive like a human
    This has been bugging me for a bit. When it comes to Chess - Stockfish is powerful because it can search incredibly deep. It doesn't 'look' at a position and 'think' a move is good. Maybe Alphazero is slightly different - as that has the deep search plus the evaluation? I can't remember exactly tbh. Poker AIs are just very deep brute force approaches too. Getting experience through tonnes of simulations. AlphaStar has to be different - this does learn general principles as the games are far too different each time. AlphaGo - I don't think I have the knowledge to comment on this one, but from the Chess and Poker ones, I would assume it's just a very deep search, but not 100% sure if anyone can clarify. Been a while since I watched the doc / read about it. I also experimented with a NN that 'solves' tic tac toe. It was a hello world level tutorial. But it didn't do anything special - the neural network was just memorizing the optimal solution for every position. When I expanded the game from TTT to connect 4, I realised it wasn't 'learning' anything from new positions. It was just memorizing every position - not impressive. Is this basically where we are at with AIs? They just have to play / simulate millions - trillions of times to outperform humans? submitted by /u/Cwlrs [link] [comments]  ( 1 min )
  • Open

    What strategies do you use to normalize data like distances to objectives?
    Hey, I am exploring neural networks and wondering how people handle unknown data ranges like distance. I'm not sure if I want to decide on an arbitrary maximum distance and normalize for that or if there's a standard way to deal with these kinds of things. One idea I had is to not use the actual distance but instead give a normalized value of the significance. If something is beyond a certain distance then it has an influence of 0 and if it's something like the creature is about to walk into a fire then the significance would be close to 1. Does that make sense? I've been seeing interesting evolution simulators (no training, occasional random changes and connections between neurons) and wanted to try my own. I made a creature with an x coordinate and known distance to a trap. I wasn't sure how to normalize the x coordinate or distance at large values so I thought of my 2nd idea there instead. Maybe that makes more sense for a sensory input anyway? Open to ideas or lexicon that will help me find an answer, thanks! submitted by /u/Tomnnn [link] [comments]  ( 1 min )
    Overfitting in first epochs
    Hey everyone, I am training a CNN network on the DVS gesture dataset using PyTorch. However, the training is not progressing in a soft way, the accuracies of both training and validation fluctuate a lot, they are both progressing, but there is a big difference between them (5~6% up to 10%) as if there is overfitting in 3/4 epoch. I have tried L2 regularization as well as a dropout with high values, the difference disappears in the first iterations but reappears strongly afterward, and I am sure that datasets are perfectly merged and split randomly. PS: May this be an underfit, how to identify an underfit ? Thanks in advance! submitted by /u/StartFinancial5917 [link] [comments]  ( 2 min )

  • Open

    Can I somehow enhance the quality of this image
    I dont know if this the right subreddit to post on but I want to enhance this image https://preview.redd.it/pq435gmd3x091.jpg?width=626&format=pjpg&auto=webp&s=5a01c6de7402dcb8d83e069cc5410a6c9c9d9017 submitted by /u/AdnanZXgamer [link] [comments]
    Without fail 😁🔥
    submitted by /u/p0goniphaft111 [link] [comments]
    who is the Godfather of AI?
    Howcome Krishnamurti said in the year 1981 that 'the computer can do ANYTHING a human being can do'? .. at a time when personal computers where very slow compared to today This might be a longer rant about these questions and offers a perspective where I suggest answers, as well as regaining control of a territory once lost: The mind. Lets start with the almost three hour long documentary by Adam Curtis called Hypernormalisation, in which he describes how governments no longer attempt to deal with the complexity of the real world, but only to the models created. What should have been a map describing the world, assumed a primary role for actions and decisions. Reality becomes 'something you handle' and perception management of the masses became important. The proces described by Ada…  ( 2 min )
    Can we train AI to capture the process of scientific evolution, and then fast-forward it to obtain future technology?
    Partly inspired by this article: https://www.quantamagazine.org/machine-scientists-distill-the-laws-of-physics-from-raw-data-20220510/, which describes AI that discovers new biology/physics equations from raw data. My question is: humans have come a long way from throwing stones to having all the technologies today. This thousands of years of evolution is a process in which new knowledge is developed from existing knowledge–countless cycles of observation, experimentation, and conclusion. Is it possible then, to train an AI to capture this process of generating new knowledge from existing knowledge, and use this AI to fast-forward scientific evolution, thus quickly obtaining future technology that would otherwise take decades to develop? submitted by /u/Independent_Ant_2027 [link] [comments]  ( 1 min )
    Pretty roses.
    submitted by /u/cookingandcraft [link] [comments]
    GATO outperformed experts on 450 tasks out of 604. A bold step towards an AGI
    submitted by /u/imapurplemango [link] [comments]  ( 1 min )
    What website to make image like this just with prompt?
    submitted by /u/Due-Ad9795 [link] [comments]
    How Uber uses AI to serve you better
    submitted by /u/OnlyProggingForFun [link] [comments]
    Apple Executive Who Left Over Return-to-Office Policy Joins Google AI Unit
    submitted by /u/bartturner [link] [comments]
    A Look At 3 Big Risks With AI
    https://www.youtube.com/watch?v=0kEqqP8PlUw submitted by /u/kbf_ [link] [comments]
    Best Natural Language Processing Books for Beginners to read in 2022
    submitted by /u/maneesh123456 [link] [comments]
    I'm looking for AI image generators that make art based on the image(s) you provide
    Hi, instead of text-to-image AI generators, I'm wondering if there are any alternatives to The Looking Glass AI. The Looking Glass AI asks you for an image, to generate images based on them and a theme input. I tried looking for alternatives myself but I was only able to come across text-to-image generators so far. submitted by /u/DTanya [link] [comments]  ( 1 min )
  • Open

    Research that may interest those working on the development of AGI (it identifies how consciousness/cognition evolved in living processes)
    A number of key challenges stand in the way of the development of HLAI and AGI. However, overcoming them has been handicapped by a lack of relevant progress in the human sciences: they have failed to produce clear understandings of how some key functions that are absent in current AI are enabled in humans. A paper just published in the journal BioSystems deals with some of these key issues: for example, it identifies at a cybernetic/systems level how consciousness, 'real-time' sensorimotor coordination, and mental modelling/cognition arose and developed in living organisms. These insights might prove useful in identifying how these functions could be instantiated and developed in AI. Titled 'The evolution and development of consciousness: the subject-object emergence hypothesis', the fu…  ( 2 min )
    What is the Alpha go training time measured in human time?
    I was listening to a talk where it was mentioned that the Alpha go training time in human time is 56 years. But there was no reference for that information. I'm wondering whether it's true, and how this was calculated. Any leads are appreciated, thanks! submitted by /u/exenson [link] [comments]  ( 1 min )
    Is there a way to sample a constant action for the whole episode?
    Hi, Usually actions are sampled for each step and applied to the agent. However, is there a way to sample an action that remains constant until the episode ends? I am using Isaac Gym. submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    PPO learns how to play browser-based game
    Hi everyone, From past years, I had a JavaScript game I wrote as a fun side-project. Now that I'm studying reinforcement learning, I wanted to see if I could code PPO into the game so that it could train *directly in the browser* without needing Python or any other dependencies... Well, I couldn't find any implementations of PPO in JavaScript, so I decided to code it into the game myself. I'm really happy with the end result - I've even included a pretrained agent that can autoplay the game live! (it beats the game ~85% of the time 😅) Source code: https://github.com/hmomin/ppo-winter-run Game: https://winter-run.com Hope you enjoy! I'm wondering if you all have any interest in a PPO implementation for JavaScript/TypeScript? If so, I may rework the source code a bit to play more easily with gym-like environments and release it as a separate project in a future post... submitted by /u/hmomin [link] [comments]  ( 1 min )
    Recommendation system with limited data?
    Hi! I’m attempting to builder a recommendation system using reinforcement learning for a side project. I was wondering if there are algorithms that work well with limited data? Thanks I’m advance! Edit: also wondering if there are any systems that are trained by reaching a goal(beating an opponent in games) and are used in a different field(not gaming) submitted by /u/brioche789 [link] [comments]  ( 1 min )
    12 Best Courses to Learn Deep Learning
    submitted by /u/MlTut [link] [comments]
  • Open

    [R][P] Gradio Web demo for StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators (SIGGRAPH 2022)
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    Deepfakes - How Long Do They Take to Make? [D]
    How long do deepfakes take to make Example scenario I make a 1 minute video where I'm talking to the camera I want to swap out my face for someone else I want it to be realistic (So people wouldn't be able to know its not my face) Questions Best way to achieve this? How long does it take? Easiest way to achieve this? Is DeepfaceLab the best option? submitted by /u/AviatorPrints [link] [comments]  ( 1 min )
    [N] Surgical Tracking Challenge
    We, at the Hamlyn Centre - Imperial College London, will be hosting a first of its kind Tracking Challenge for Surgery at MICCAI 2022. If you are interested, please visit: https://surgt.grand-challenge.org submitted by /u/aweld20 [link] [comments]
    [D] Looking for examples of failures of correlation-based NLP
    I am looking for examples (eg screenshots of dialogues) of using recent NLP models and getting unintuitive results because the model infers a wrong causal relationship between two words because they occur together often. submitted by /u/dr_cosmicomical [link] [comments]
    [D] Character-level vs. word-level tokenization
    Hi all, I'm relatively new to the field of NLP and while reading a blog post from 2015 The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy, I was wondering about this part of the "Further Reading" section: Currently it seems that word-level models work better than character-level models, but this is surely a temporary thing. Aren't most state-of-the art models these days using some kind of vocabulary, i.e. whole words or at least sub-words? Text in the wild can be full of typos, emojis or other unicode crazyness, so wouldn't training all these LLMs on a character level make them more flexible and better applicable to real life problems? I'd love to hear your opinions about this and to be pointed towards good resources to learn more about different tokenization methods and their limitations and performance implications. Cheers. submitted by /u/CodeAllDay1337 [link] [comments]  ( 2 min )
    [Project] How to create visualizations for "complex" networks?
    For a presentation I need to make my own visualization of a graph neural network with encoders and other additional parts. I would like to get something like this: ​ https://preview.redd.it/yi1erwam1t091.png?width=868&format=png&auto=webp&s=adabbe066182b2065b835ea50e8fc83c3574910a Does anyone know a convenient way to do this? submitted by /u/mr_birrd [link] [comments]  ( 2 min )
    [N] Introducing PeerXiv - A modern platform for peer-review of preprints
    (Check out the Twitter thread) What would a peer review process look like if it was designed today? Peer review is one of the cornerstones of the research community, and yet while our community keeps advancing and growing, the reviewing process remains almost unchanged. We strongly believe that peer review can be so much better for both authors and reviewers and we are excited to share PeerXiv, our proposal to do just that. Check out the PeerXiv Mock: https://peerxiv.web.app/about ​ https://i.redd.it/g0nv9iz94s091.gif PeerXiv is a modern platform for the peer review of preprints. Authors can submit their preprints and get feedback directly from a set of anonymous PeerXiv reviewers who earn reputation points for their effort📈 ​ https://preview.redd.it/5rfck4kb4s091.jpg?width=1022&…  ( 2 min )
  • Open

    Decoding a grid square
    I saw a reference last night to the grid square EL29fx and wanted to figure out where that is. There are many programs that will do this for you, but I wanted to do it by hand. I wrote about how grid squares work a year ago, but I was rusty on the details, so […] Decoding a grid square first appeared on John D. Cook.  ( 2 min )
    Exponential of a line
    Draw a line in the complex plane. What is the image of that line when you apply the exponential function? A line through w with direction z is the set of points w + tz where w and z are complex and t ranges over the real numbers. The image of this line is exp(w+ […] Exponential of a line first appeared on John D. Cook.  ( 1 min )
    Discrete derivatives
    If you’ve taken calculus, and someone asks you what the derivative of x5 is, you can say without hesitation that it’s 5x4. Now suppose they come back and say, “I’m sorry. I forgot to give you any context. Here x5 is a polynomial in the field of 343 elements.” It turns out that this additional […] Discrete derivatives first appeared on John D. Cook.  ( 2 min )
  • Open

    Monkey Patching Python Code
    Python is a dynamic scripting language. Not only does it have a dynamic type system where a variable can be assigned to one type first and changed later, but its object model is also dynamic. This allows us to modify its behavior at run time. A consequence of this is the possibility of monkey patching. […] The post Monkey Patching Python Code appeared first on Machine Learning Mastery.  ( 13 min )
  • Open

    Unleash Your Dragon (Remastered 8K 60FPS)
    submitted by /u/stepanmetior [link] [comments]
  • Open

    Keeping Your Company’s Data Model IP
    Robotic process automation (RPA) when it first gained attention five or so years ago disappointed me. Here was yet another stopgap measure, a digital form of tape and baling wire to temporarily join scripts of oft-repeated tasks in a specified workflow sequence across commonly used applications. RPA seemed quite brittle–wouldn’t you have to regenerate the… Read More »Keeping Your Company’s Data Model IP The post Keeping Your Company’s Data Model IP appeared first on Data Science Central.  ( 5 min )

  • Open

    [R] A paper that contributes to the development of AGI by identifying how consciousness/cognition evolved in living processes
    A number of key challenges stand in the way of the development of HLAI and AGI. However, overcoming them has been handicapped by a lack of relevant progress in the human sciences: they have failed to produce clear understandings of how some key functions that are absent in current AI are enabled in humans. A paper just published in the journal BioSystems deals with some of these key issues: for example, it identifies at a cybernetic/systems level how consciousness, 'real-time' sensorimotor coordination, and mental modelling/cognition arose and developed in living organisms. These insights might prove useful in identifying how these functions could be instantiated and developed in AI. Titled 'The evolution and development of consciousness: the subject-object emergence hypothesis', the fu…  ( 2 min )
    [N] Introducing NGC-Learn: Predictive Coding and Neurobiologically-Motivated Learning in Python
    Interested in doing research in neurobiologically-inspired artificial neural networks? Need an open-source, actively maintained tool for reproducing the latest paper on predictive coding or building your own more biologically-faithful neural system? ngc-learn is a recently-released Python library designed in response to these questions. The ngc-learn dynamics simulator is specifically meant for building, simulating, and analyzing arbitrary predictive coding models based on the neural generative coding (NGC) computational framework and theoretically guided by the free energy principle. This toolkit, distributed under the 3-Clause BSD license, is built on top of Tensorflow 2. Notably, ngc-learn's extensible nodes-and-cables system is general and can even be used to build non-predictive cod…  ( 1 min )
    [D] How would one even maintain a generalist agent like Gato?
    DeepMind just published a paper on Gato which is a generalist agent that can perform more than 600+ tasks. I get that there's still a huge debate/discussion about AGI, HLAI, ethical concerns, etc., so such an agent is unlikely to be deployed in production soon, but let's just entertain that idea for a second, how would one even maintain a model like that? If it performs well for all tasks except for one, how would you retrain only for that one tasks that it underperformed? The model uses the same weights and biases for all the tasks, so even if you choose to freeze certain nodes, how would you even go about to do that? submitted by /u/Lexayne [link] [comments]  ( 1 min )
    "[Project]" Brainchop: In-browser deep learning framework for volumetric Segmentation
    ​ https://preview.redd.it/udqgrwbzlo091.png?width=570&format=png&auto=webp&s=d856ab9334e6a0c5ae052c762fdfb8d6e22cfb61 Live Demo: brainchop.org Brainchop is a client-side web-application for automatic segmentation of MRI volumes that brings automatic volumetric segmentation capability to neuroimaging by running a robustly pre-trained deep learning model. The app does not require technical sophistication from the user and is designed for locally and privately segmenting user’s T1 volumes. Results of the segmentation may be easily saved locally after the computation. An intuitive interactive interface that does not require any special training nor specific instruction to run enables access to a state of the art deep learning brain segmentation for anyone with a modern browser (e.g. Firefox, Chrome etc) and commonly available hardware. Additionally, we make implementation of brainchop freely available releasing its pure Javascript code as open-source. https://preview.redd.it/q9q0zq3wlo091.png?width=3333&format=png&auto=webp&s=0110de38b24890373f4640166c53de508969ded0 submitted by /u/Character-Rip-5824 [link] [comments]  ( 1 min )
    What libs/boiler plate/platforms do you use to abstract and optimize your workflow when starting a new project? [D]
    E.g.. New data project, currently investigating the data. I'm joining multiple data sources to create a standard data object represented by X number of features. Now I want to get a sense of the distribution. and skew of each feature (basic stat analysis), do some elementary clustering, and be able to repeat the process as I add, change or remove features. I can write abstractions in code to do these things such as a function that plots the frequency of values for a specific feature + gives mean,mode,median etc but I'm wondering if there are libraries or platforms you all use to avoid this tediousness. I understand that may sacrifice the customizable nature of writing your own code however I'm interested in such things to get a quick sense of the metadata. submitted by /u/gravbeamemitter [link] [comments]  ( 1 min )
    [P] SSO (Single Sign-On) for CVAT, the computer vision annotation tool
    For those who are interested in using CVAT with SSO, previously I made a proof-of-concept video to demonstrate my SSO implementation for CVAT: https://www.youtube.com/watch?v=R7hBBLG5Fdc Now I'm happy to announce that I have submitted my code changes: https://github.com/AlexGaoDW/cvat/tree/feature/datawiza-sso And I've created a PR to get it into the official repo. You can try it out by yourself following the document here: https://docs.datawiza.com/guides/cvat.html I also set up an instance using Google as the identity provider such that you can try SSO functionality with your Google account: https://cvat-sso.datawiza.net/ Enjoy! submitted by /u/alexcgg1 [link] [comments]  ( 1 min )
    [R] Proof Of Useful Work For AI Training On The Blockchain
    submitted by /u/EducationalCicada [link] [comments]  ( 2 min )
    [D] IEEE Access Article accepted as the first author, what's next?
    Dear members, This is my very first experience in submitting to IEEE Access as a first author. My article was accepted after submitting minor edits with a rebuttal. However, in my acceptance email, the reviewer asks me to incorporate his/her comments including revamping the sections, adding elements in figures, more explanations, and adding references. The same email also states that "you are not permitted to add or remove authors or references post-acceptance" and "to take this opportunity to improve the English grammar and check spelling, as the article is only lightly edited before publication". This has become a very confusing situation for me. Anyone having experience in publishing in this journal is requested for guidance, please. submitted by /u/HQ2020 [link] [comments]  ( 2 min )
    [P] Add conditional node to random forest?
    Hi all, Premise that I am pretty new to ML, so pardon me if I say anything wrong. I am wondering if it would be possible to do the following: a modified semi-supervised RandomForest, with a conditional node added as a positive branch to all decision trees which controls for certain features? Basically I would like for the algorithm to check the quantitative level of a feature (dependent variable) in samples positive for the said feature (independent variable). The algorithm has to set a cutoff for the quantitative variable only if the categorical variable is TRUE (such is the case when the quantitative variable is =/ 0). Does it make sense? How can I add such conditional statement? I mainly operate on R using RandomForest but I am also a python user, so open to try scikit learn. Thanks submitted by /u/ElMochiKris [link] [comments]  ( 1 min )
    [D] Mining restaurant reviews for vibe
    tl;dr: I recently procured 1m+ restaurant reviews. Now I'd like to see if I can mine any hidden and interesting information from them to build a search system. Looking for advice. If you were doing this project - "determine the vibe of a restaurant based on the reviews. Allow a user to search based on expected loudness at venue" - how would you approach it? Boring text about what I've tried: So far, I've tried using top2vec on different granularity of the reviews. For example: each document is all reviews for a restaurant concatenated together each document is a complete review each document is a paragraph each document is a single sentence If I search for topics, I can certainly find some interesting ones. Here are the top 10 from the "complete review" model, for example: …  ( 3 min )
    [P] Extract relevant text from error logs
    I'm trying to extract relevant text from error logs (example below) then classifying by error reason. I've been doing (but mostly failing) it using the following process: Regex to filter out text between brackets and removing dates, times, IPs, filepaths, etc Remove stop words and keep only real words Put the resulting string of words through a sentence transformer (all-mpnet-base-v2 in this case) to get embeddings Reduce dimensionality and cluster Within clusters, use the most frequent words (up to 5) as a label for the "error reason" This process is really brittle because of the regex and because there is so much variability in log structures. Furthermore, the "error reason" labels are usually nonsensical because the process doesn't properly extract the relevant text. For the example below, I get "failed_run_coverage_statements_panic", when I'd hope to get something relating to timing out. Does anyone have any ideas of how I might approach this? Example error log: Failed === RUN GoJobSubscriber {"level":"info","ts":1649273105.2566006,"caller":"rabbitmq/connection.go:20","msg":"client successfully connected to RMQ"} {"level":"info","ts":1649273105.2577667,"caller":"rabbitmq/connection.go:28","msg":"client successfully created channel"} {"level":"info","ts":1649273105.258319,"caller":"rabbitmq/connection.go:34","msg":"client successfully configured QOS"} {"level":"info","ts":1649273105.2584598,"caller":"rabbitmq/connection.go:45","msg":"client successfully assigned dependencies"} {"level":"info","ts":1649273105.2894936,"caller":"rabbitmq/connection.go:52","msg":"client successfully ensured topology"} {"level":"info","ts":1649273105.2906847,"caller":"rabbitmq/subscription.go:46","msg":"Starting event subscriber...."} coverage: 55.8% of statements panic: test timed out after 2m0s goroutine 11 [running]: testing.(*M).startAlarm.func1() submitted by /u/iblis3 [link] [comments]  ( 1 min )
    [D] Do you use any tools to automate the data cleaning process?
    Data cleaning is very tedious. Are there solutions that actually solve the data cleaning problem or is it that we have to go through it manually because that helps to learn about the data? submitted by /u/Speech_titan [link] [comments]  ( 1 min )
    [2205.08957] Meta-Learning Sparse Compression Networks
    submitted by /u/Forsaken_Scientist [link] [comments]
    [D] Thinking about buying a Jetson Nano and Coral Stick for ML
    Currently I use several hours a day google colab and kaggle kernels to train models. But with the limitations of the free usage I ask myself worth it to buy a Jetson Nano 4GB and using the Coral Stick together as an alternative or buying colab pro is better? Most of time i use Tensorflow for mostly image/video classification/object detection Is it worth buying or has anyone had similar suggestions or experiences? submitted by /u/Stevy1981 [link] [comments]  ( 1 min )
    [D] Google Colab equivalent alternative to R / Rstudio
    I want to use GPU for running some R code I know that Google Colab is a good alternative if using python and allows access to GPU - is there an equivalent alternative to RStudio? What about RStudio server what are the advantages of Using Rstudio server when it anyway uses the computer CPU? submitted by /u/triary95 [link] [comments]  ( 1 min )
    [P] Reviewing Marketing Mix Model - concerned about a few issues...
    Hi, everyone. I am reviewing a marketing mix model done by someone before I started the job. They have a model which has been claimed to give an accurate attribution across different medial channels and other variables. It gives a pretty good result (R2 = 0.95), I admit. But I was concerned by the way the long-term effect of advertising has been worked into the model. The current model uses the sum of brand investment with a 95% carryover effect to account for the 'long term effect', while also using the same brand investment figure to account for the 'short term effect'. I pointed out that this would raise multicollinearity issues, but was told that this isn't a problem as the "modeling was done with an RF". Is there a way to better explain my concern? submitted by /u/SuperSodori [link] [comments]  ( 1 min )
    [D] Current use of word2vec in 2022
    We can all agree that word2vec was revolutionary in the field of NLP. My question is what is the current state of this technique? Is it mainly used for educational purposes to teach students about the building blocks of word embeddings? - Is it used in other fields such as studying the evolution of languages through historical texts? - Is it used in specific tasks such as only syntactic similarities, or only semantic similarities? - ... submitted by /u/LanverYT [link] [comments]  ( 2 min )
    [D] Is Mixed Market Modelling full of crap?
    I have spent the last year working as an Analyst for a large media company and have been primarily occupied with MMM projects. I need to clarify that I had never previously worked/studied economics and I come from a natural sciences background with a focus on statistics.Since I have been in this company, I have been struggling so much for several different reasons. One of these reasons is because I feel like the work we're doing is complete and utter bullshit! I am trying to figure out if the problem is me (which I am sure that some of it is) or if this is a common phenomenon. Every time I start modelling I am super enthusiastic about it and determined that this time it'll all make sense and every time I end up exasperated and ready to give up, quit my job and never come back. I feel like the models we are building are so full of crap, and simply aimed at justifying our client's expectations in order to make them happy - we always end up using mad coefficients and breakdowns of variables just so that they are what they should be.I consider myself to be very analytical and with good problem-solving skills - have been praised by my managers and given really good feedback for exceeding expectations and all that jazz. HOWEVER, I constantly feel like an utter failure, because I spend so much time trying to make sense of these models to the point where I exhaust myself and give up and then just do what I feel everyone else does - manipulate the metrics so that it works our way and this kills my soul every single time. Is this something that is widely happening in the industry and am I being too idealistic/perfectionist? Or am I seriously lacking some training (that I wouldn't say I got much of) and what can I do to improve? P.S. I am at the point where I have almost given up and I am close to leaving my job. I have run my mental health down so much to the point of burnout so any advice would be so very much appreciated! submitted by /u/necplorer [link] [comments]  ( 6 min )
    [P] Training to read PDF documents. Any ideas?
    Currently I use regex to identify patterns coupled with the OCR library (Tesseract) to import PDF data into dataframes. It's a bit of hit n miss process. Every new exception has to be built into regex. I was wondering if machine learning can take up the job? I was wondering one way to go about it is marking positions of data from the OCR to be read into respective columns but there might be a better way to go about it. Has anybody done this? submitted by /u/card_chase [link] [comments]  ( 4 min )
    [R] AI Research Rankings 2022: Sputnik Moment for China?
    Hey Reddit! A heated debate is going on today on the state of the strategic race between the United States and China to dominate in AI. I decided to gather some facts and analyzed publications at ICML 2021 and NeurIPS 2021. Here are the findings -- would love to hear what you think! 🤝❤️🤖 https://thundermark.medium.com/ai-research-rankings-2022-sputnik-moment-for-china-64b693386a4 submitted by /u/chersonesus [link] [comments]  ( 2 min )
  • Open

    Sim-2-real problem regarding system delay
    If the goal lies in training an agent for robot control policy, the actions stand for current values which control the robot joints. In the real system, however, there exist system delays and communication delays. So applying the actions to the robot would not directly result in motions, which is however in the case of simulation (for instance ISAAC GYM that I am using). As I have measured, the real system takes 250~300 ms to react to the given system input and rotate its joints. Therefore, the control policy trained in the simulator, where the system delay is almost 0~15 ms, is not useable anymore. What would be the approaches to overcome this sim-2-real problem in this case without identifying the model of the system? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Hierarchical Reinforcement Learning (HRL)
    Hello, I intend to apply HRL more specifically the option-critic where I have three options. however, I don´t see many documentation in the internet. Do you have any implementation code to get inspire from it (to see how it works). submitted by /u/GuavaAgreeable208 [link] [comments]  ( 1 min )
    Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments
    submitted by /u/jkterry1 [link] [comments]
    Ideas to start writing papers
    Hey there! I'm eager to start writing a paper, I have a background in AI, but I can't think of ideas for a topic to write a paper about. How do you guys get ideas for papers? Is it a good idea to start writing surveys? submitted by /u/ZazieIsInTheHouse [link] [comments]  ( 1 min )
    Channel policy on RL-related job opportunities
    Hi all, I am aware of two job opportunities at my company in the area of experimentation and RL (mainly MAB and cMAB style research and application). I was curious what this channels policy was on making people aware of these job openings or if there was a better venue for that (discord channels, etc). Let me know! Thanks! submitted by /u/Helga-Helga [link] [comments]  ( 1 min )
    Let's build an Autonomous Taxi 🚖 using Q-Learning (Deep Reinforcement Learning Free Class by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the second Unit of Deep Reinforcement Learning Class) 🥳 In this Unit, we're going to dive deeper into one of the Reinforcement Learning methods: value-based methods and study our first RL algorithm: Q-Learning. We'll also implement our first RL agent from scratch: a Q-Learning agent and will train it in two environments and share it with the community: Frozen-Lake-v1 ⛄ (non-slippery version): where our agent will need to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoiding holes (H). An autonomous taxi 🚕 will need to learn to navigate a city to transport its passengers from point A to point B. You’ll be able to compare the results of your Q-Learning agent using the leaderboard 🏆 1️⃣ The introduction to q-learning part 1 article 👉 https://huggingface.co/blog/deep-rl-q-part1 2️⃣ The introduction to q-learning part 2 article 👉 https://huggingface.co/blog/deep-rl-q-part2 3️⃣ The hands-on 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit2/unit2.ipynb 4️⃣ The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard https://i.redd.it/2zk7rb9w9n091.gif If you have questions and feedback I would love to answer, submitted by /u/cranthir_ [link] [comments]  ( 1 min )
    Any recommended paper on using Deep RL on Drones
    Hi, I am a beginner in using Deep RL on robots (especially Drones). I have tried on RL OpenAI gym environments like Cartpole, Breakout and BipedalWalker but I have never applied it on actual robot. Now I wanted to apply DDPG on a drone (in simulation), I have failed miserably several times now. Is there any paper that explains it in detail? Any others resources are also welcomed. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    What is offline reinforcement learning?
    Offline Reinforcement Learning(RL), also known as Batch Reinforcement Learning, is a variant of RL that effectively leverages large, previously collected datasets for large-scale real-world applications. The use of static datasets means that during the training process of the agent, offline RL does not perform any form of online interaction and exploration, which is also the most significant difference from online reinforcement learning methods. For convenience, we refer to non-offline reinforcement learning, including both on-policy and off-policy RL, as online reinforcement learning (Online RL) in the following sections. ​ Illustration of three classic online reinforcement learning modes In the figure, (a) stands for On-policy RL, where the agent uses the current policy πk to interact…  ( 8 min )
    Looking for project mates
    I'm trying to reimplement the paper High-Throughput Synchronous Deep RL using Ray. Is anyone interested in working on this together? submitted by /u/SirRantcelot [link] [comments]  ( 1 min )
  • Open

    Excuse the language, but why the fuck is there nothing accessible to the public right now that can generate genuinely comprehensible images like Dalle 2?
    Forgive me for sounding irritated, but is there absolutely nothing I can use right now, be it a website, program, mobile app, whatever, that I can just pump some cool stuff out of? Why are we all being shown this amazing technology only to be told "Yo this sick tech exists, but you ain't fuckin using it lol" except for a few select people (e.g. MKBHD's access to dalle 2 for a day).. I'm not talking about excuses such as nightcafe or wombodream.. I've used them to death and had nothing but quite frankly terrible results and want something that I can pop in terms such as, I don't know, "Drift car nissan silvia" that gives a picture of an actual car for album art etc, or something along those lines, if not just to admire how crazy AI is becoming. Look, if this stuff existed but was kept completely secret, then the whole 'what you don't know can't hurt you' idea would apply and I would not be pissed off. What is frustrating is that all this wizardry is being flaunted and dangled in our faces whilst also being kept out of reach. So my initial question still applies, is there anything remotely available right now that can generate images somewhat comparible? Thanks! submitted by /u/Michael_Goodwin [link] [comments]  ( 3 min )
    TPI - an open-source tool that is the first to train machine learning models on any cloud using Terraform solution
    Terraform Provider Iterative (TPI) is first technological product that simplifies ML training on any cloud as it helps the infrastructure and ML team members save significant time and money in maintaining and configuring the training infrastructure: Iterative adoption of an open-source tool that is the first to train machine learning models on any cloud using Terraform Terraform provides a flexible CLI service system for managing hundreds of cloud services, and TPI enables data scientists to delegate responsibilities without discovering software. submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Rephrasing an English paragraph with using AI - Writing is all about culture
    submitted by /u/data-gig [link] [comments]
    OpenAI's DALL-E 2 is pretty compliant - but who is responsible anyway?
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    A Roadmap overview To A Career In AI
    According to the World Economic Forum, over 97 million more jobs may be created by 2025, all due to the application of AI in different fields. AI is one of the century’s most important and disruptive technology achievements. Here is A Roadmap overview To A Career In AI submitted by /u/artiba-AI [link] [comments]
    This article showcases a tutorial on fine-tuning layoutLM v2 for invoice recognition starting from annotation to training and inference.
    submitted by /u/UBIAI [link] [comments]
    Automated CV Pipelines | Box to Polygon
    The 5th episode of the webinar series on Automated CV Pipelines is coming up! It will be covering automatic instance segmentation and methods to streamline the annotation process. If you're interested, you can register here! submitted by /u/WeekendClassic [link] [comments]
    Google AI Plans for 2030
    When someone searches a query on Google, it leaves a carbon dioxide footprint in the atmosphere. Google AI plans for 2030 have been revealed, discussing what goals Google has to lower its carbon footprint and allowing other partners to play their part in guaranteeing environmental sustainability. Read more submitted by /u/ridamughal110 [link] [comments]
    Top AI Investment Opportunities In 2021
    There are multiple reasons companies are finding investment opportunities in AI that are going to be beneficial for them in the year 2021. Investment in AI will help a wide range of organizations go through the economic crisis as they emerge from the pandemic. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    How Air Traffic Can Be Optimized Using Artificial Intelligence
    With the fast-growing and high-density global air traffic, ensuring efficiency and air transportation safety becomes a critical challenge. AI is already revolutionizing the way air traffic management systems are manufactured and hence is believed to play a key role in optimizing air traffic flow. Read more submitted by /u/ridamughal110 [link] [comments]
    Is the advancement of AI in accordance with the best long-term interests of Humanity?
    AI advancements in digital technology are growing, and today we have far more technical capability than we had in the 90s, with the potential to expand even quicker in the future. Is, however, the continuance of AI growth in the best interests of humanity? Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Is artificial intelligence the most powerful thing that humans have ever created?
    Artificial Intelligence is one of the most powerful things humans have been working on for decades, and its limitless magical spells are altering our lives. Read more submitted by /u/ridamughal110 [link] [comments]  ( 1 min )
    Build ML with zero data, training or setup with Humingbird!
    submitted by /u/holamyeung [link] [comments]  ( 2 min )
    Where can I best get OPT 175B to run?
    I know I sound like a douche. I got access to the OPT 175B mode for my research, but my universitie’s GPU capabilities aren’t sufficient. Usually, I train my LLM on two local 50GB GPUs, that doesn’t seem to work now - so - what would you recommend? submitted by /u/Trick_Brain [link] [comments]  ( 1 min )
    Reinforcement learning companies
    Are there any companies (startups are fine too) doing work in Reinforcement Learning - more specifically in game NPCs/bots that you guys know? submitted by /u/Nice_Working [link] [comments]
    Please help me out in my research.
    I used the glob function, but instead of getting 33 categories I only get one, hence the output layer of the model is just 1, how could I fix it? Thank you guys in advance and all the help will be appreciated https://colab.research.google.com/drive/1NdpWmOVbqV2EUJ50vLj7nhWdWWHtbY_8?usp=sharing submitted by /u/Vector-Desperandum24 [link] [comments]  ( 1 min )
  • Open

    5 Ways to Optimize Database Performance
    Database performance allows developers or database administrators to enhance the system resources for lasting performance improvements. Databases are like the central nervous system of an application. They are responsible for the organization and function of critical processes. The minor database-related performance issues have the ability to impact the entire operation. Locating issues in the databases… Read More »5 Ways to Optimize Database Performance The post 5 Ways to Optimize Database Performance appeared first on Data Science Central.  ( 6 min )
  • Open

    AI Predicts Volcano Eruptions + Rapidly Detect Earthquakes | AI Robot Autonomously Reduces Jellyfish Overpopulation | AI Detects Pancreatic Cancer
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    Advice
    Hello, I need help with finding weights on a couple of outputs but I don't quite understand how it works fully. Could anyone help me? Its for educational purposes, I just want advice and help, not to have the work done for me. submitted by /u/106steve [link] [comments]
  • Open

    What is Extended Reality?
    Advances in extended reality have already changed the way we work, live and play, and it’s just getting started. Extended reality, or XR, is an umbrella category that covers a spectrum of newer, immersive technologies, including virtual reality, augmented reality and mixed reality. From gaming to virtual production to product design, XR has enabled people Read article > The post What is Extended Reality? appeared first on NVIDIA Blog.  ( 5 min )
    From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX
    Building next-generation intelligent vehicles requires an AI infrastructure that pushes the cutting edge. Electric vehicle maker NIO is using NVIDIA HGX to build a comprehensive data center infrastructure for developing AI-powered, software-defined vehicles. With high-performance compute, the automaker can continuously iterate on sophisticated deep learning models, creating robust autonomous driving algorithms in a closed-loop environment. Read article > The post From Cloud to Car: How NIO Develops Intelligent Vehicles on NVIDIA HGX appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    ML for Algorithmic Trading, with Stefan Jansen
    Listen to this episode on Anchor FM. In this episode of the DATAcated podcast, host Kate Strachnyi talks with Stefan Jansen about machine… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    Data Science Project Vs Machine Learning Project: Python Data Science Complete Course — Part2
    Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 5 min )
  • Open

    The Berkeley Crossword Solver
    We recently built the Berkeley Crossword Solver (BCS), the first computer program to beat every human competitor in the world’s top crossword tournament. The BCS combines neural question answering and probabilistic inference to achieve near-perfect performance on most American-style crossword puzzles, like the one shown below: Crosswords are challenging for humans and computers alike. Many clues are vague or underspecified and can’t be answered until crossing constraints are taken into account. While some clues are similar to factoid question answering, others require relational reasoning or understanding difficult wordplay. Here are a handful of example clues from our dataset (answers at the bottom of this post): They’re given out at Berkeley’s HAAS School (4) Winter hrs. in Berkeley (3) …  ( 3 min )
  • Open

    Artificial intelligence predicts patients’ race from their medical images
    Study shows AI can identify self-reported race from medical images that contain no indications of race detectable by human experts.  ( 7 min )
  • Open

    Design choice and machine learning model performances. (arXiv:2201.10239v2 [stat.ML] UPDATED)
    An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often not driven by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This article discusses the choice of design in relation to the ML model performances. A study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    Class-Aware Generative Adversarial Transformers for Medical Image Segmentation. (arXiv:2201.10737v3 [cs.CV] UPDATED)
    Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of generative adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.
    Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v4 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. Our first result is a new universal lower bound on the number of single-node interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of single-node interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better. To prove our lower bound, we develop the notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. We also use the techniques developed here to extend our results to the setting of multi-node interventions.
    Unsupervised Learning of Rydberg Atom Array Phase Diagram with Siamese Neural Networks. (arXiv:2205.04051v2 [physics.comp-ph] UPDATED)
    We introduce an unsupervised machine learning method based on Siamese Neural Networks (SNN) to detect phase boundaries. This method is applied to Monte-Carlo simulations of Ising-type systems and Rydberg atom arrays. In both cases the SNN reveals phase boundaries consistent with prior research. The combination of leveraging the power of feed-forward neural networks, unsupervised learning and the ability to learn about multiple phases without knowing about their existence provides a powerful method to explore new and unknown phases of matter.
    SepTr: Separable Transformer for Audio Spectrogram Processing. (arXiv:2203.09581v2 [cs.CV] UPDATED)
    Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v2 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence at an exponential rate to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
    Personalized Interventions for Online Moderation. (arXiv:2205.09462v1 [cs.SI])
    Current online moderation follows a one-size-fits-all approach, where each intervention is applied in the same way to all users. This naive approach is challenged by established socio-behavioral theories and by recent empirical results that showed the limited effectiveness of such interventions. We propose a paradigm-shift in online moderation by moving towards a personalized and user-centered approach. Our multidisciplinary vision combines state-of-the-art theories and practices in diverse fields such as computer science, sociology and psychology, to design personalized moderation interventions (PMIs). In outlining the path leading to the next-generation of moderation interventions, we also discuss the most prominent challenges introduced by such a disruptive change.
    Truncated tensor Schatten p-norm based approach for spatiotemporal traffic data imputation with complicated missing patterns. (arXiv:2205.09390v1 [stat.ML])
    Rapid advances in sensor, wireless communication, cloud computing and data science have brought unprecedented amount of data to assist transportation engineers and researchers in making better decisions. However, traffic data in reality often has corrupted or incomplete values due to detector and communication malfunctions. Data imputation is thus required to ensure the effectiveness of downstream data-driven applications. To this end, numerous tensor-based methods treating the imputation problem as the low-rank tensor completion (LRTC) have been attempted in previous works. To tackle rank minimization, which is at the core of the LRTC, most of aforementioned methods utilize the tensor nuclear norm (NN) as a convex surrogate for the minimization. However, the over-relaxation issue in NN refrains it from desirable performance in practice. In this paper, we define an innovative nonconvex truncated Schatten p-norm for tensors (TSpN) to approximate tensor rank and impute missing spatiotemporal traffic data under the LRTC framework. We model traffic data into a third-order tensor structure of (time intervals,locations (sensors),days) and introduce four complicated missing patterns, including random missing and three fiber-like missing cases according to the tensor mode-n fibers. Despite nonconvexity of the objective function in our model, we derive the global optimal solutions by integrating the alternating direction method of multipliers (ADMM) with generalized soft-thresholding (GST). In addition, we design a truncation rate decay strategy to deal with varying missing rate scenarios. Comprehensive experiments are finally conducted using real-world spatiotemporal datasets, which demonstrate that the proposed LRTC-TSpN method performs well under various missing cases, meanwhile outperforming other SOTA tensor-based imputation models in almost all scenarios.
    Evolutionary latent space search for driving human portrait generation. (arXiv:2204.11887v2 [cs.CV] UPDATED)
    This article presents an evolutionary approach for synthetic human portraits generation based on the latent space exploration of a generative adversarial network. The idea is to produce different human face images very similar to a given target portrait. The approach applies StyleGAN2 for portrait generation and FaceNet for face similarity evaluation. The evolutionary search is based on exploring the real-coded latent space of StyleGAN2. The main results over both synthetic and real images indicate that the proposed approach generates accurate and diverse solutions, which represent realistic human portraits. The proposed research can contribute to improving the security of face recognition systems.
    Approximating Persistent Homology for Large Datasets. (arXiv:2204.09155v2 [stat.ML] UPDATED)
    Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
    Causal Inference from Small High-dimensional Datasets. (arXiv:2205.09281v1 [cs.LG])
    Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What if this is not the case? In this work, we propose Causal-Batle, a methodology to estimate treatment effects in small high-dimensional datasets in the presence of another high-dimensional dataset in the same feature space. We adopt an approach that brings transfer learning techniques into causal inference. Our experiments show that such an approach helps to bring stability to neural network-based methods and improve the treatment effect estimates in small high-dimensional datasets.
    Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation. (arXiv:2205.04183v2 [cs.CV] UPDATED)
    We propose a simple but effective source-free domain adaptation (SFDA) method. Treating SFDA as an unsupervised clustering problem and following the intuition that local neighbors in feature space should have more similar predictions than other features, we propose to optimize an objective of prediction consistency. This objective encourages local neighborhood features in feature space to have similar predictions while features farther away in feature space have dissimilar predictions, leading to efficient feature clustering and cluster assignment simultaneously. For efficient training, we seek to optimize an upper-bound of the objective resulting in two simple terms. Furthermore, we relate popular existing methods in domain adaptation, source-free domain adaptation and contrastive learning via the perspective of discriminability and diversity. The experimental results prove the superiority of our method, and our method can be adopted as a simple but strong baseline for future research in SFDA. Our method can be also adapted to source-free open-set and partial-set DA which further shows the generalization ability of our method.
    High-resolution landscape-scale biomass mapping using a spatiotemporal patchwork of LiDAR coverages. (arXiv:2205.08530v1 [stat.AP] CROSS LISTED)
    Estimating forest aboveground biomass at fine spatial scales has become increasingly important for greenhouse gas estimation, monitoring, and verification efforts to mitigate climate change. Airborne LiDAR continues to be a valuable source of remote sensing data for estimating aboveground biomass. However airborne LiDAR collections may take place at local or regional scales covering irregular, non-contiguous footprints, resulting in a 'patchwork' of different landscape segments at different points in time. Here we addressed common obstacles including selection of training data, the investigation of regional or coverage specific patterns in bias and error, and map agreement, and model-based precision assessments at multiple scales. Three machine learning algorithms and an ensemble model were trained using field inventory data (FIA), airborne LiDAR, and topographic, climatic and cadastral geodata. Using strict selection criteria, 801 FIA plots were selected with co-located point clouds drawn from a patchwork of 17 leaf-off LiDAR coverages 2014-2019). Our ensemble model created 30m AGB prediction surfaces within a predictor-defined area of applicability (98% of LiDAR coverage) and resulting AGB predictions were compared with FIA plot-level and areal estimates at multiple scales of aggregation. Our model was overall accurate (% RMSE 13-33%), had very low bias (MBE $\leq$ $\pm$5 Mg ha$^{-1}$), explained most field-observed variation (R$^2$ 0.74-0.93), produced estimates that were both largely consistent with FIA's aggregate summaries (86% of estimates within 95% CI), as well as precise when aggregated to arbitrary small-areas (mean bootstrap standard error 0.37 Mg ha$^{-1}$). We share practical solutions to challenges faced when using spatiotemporal patchworks of LiDAR to meet growing needs for biomass prediction and mapping, and applications in carbon accounting and ecosystem stewardship.
    Multi-DNN Accelerators for Next-Generation AI Systems. (arXiv:2205.09376v1 [cs.AR])
    As the use of AI-powered applications widens across multiple domains, so do increase the computational demands. Primary driver of AI technology are the deep neural networks (DNNs). When focusing either on cloud-based systems that serve multiple AI queries from different users each with their own DNN model, or on mobile robots and smartphones employing pipelines of various models or parallel DNNs for the concurrent processing of multi-modal data, the next generation of AI systems will have multi-DNN workloads at their core. Large-scale deployment of AI services and integration across mobile and embedded systems require additional breakthroughs in the computer architecture front, with processors that can maintain high performance as the number of DNNs increases while meeting the quality-of-service requirements, giving rise to the topic of multi-DNN accelerator design.
    Simplifying Node Classification on Heterophilous Graphs with Compatible Label Propagation. (arXiv:2205.09389v1 [cs.LG])
    Graph Neural Networks (GNNs) have been predominant for graph learning tasks; however, recent studies showed that a well-known graph algorithm, Label Propagation (LP), combined with a shallow neural network can achieve comparable performance to GNNs in semi-supervised node classification on graphs with high homophily. In this paper, we show that this approach falls short on graphs with low homophily, where nodes often connect to the nodes of the opposite classes. To overcome this, we carefully design a combination of a base predictor with LP algorithm that enjoys a closed-form solution as well as convergence guarantees. Our algorithm first learns the class compatibility matrix and then aggregates label predictions using LP algorithm weighted by class compatibilities. On a wide variety of benchmarks, we show that our approach achieves the leading performance on graphs with various levels of homophily. Meanwhile, it has orders of magnitude fewer parameters and requires less execution time. Empirical evaluations demonstrate that simple adaptations of LP can be competitive in semi-supervised node classification in both homophily and heterophily regimes.
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v2 [stat.ML] UPDATED)
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.
    Towards Applicable Reinforcement Learning: Improving the Generalization and Sample Efficiency with Policy Ensemble. (arXiv:2205.09284v1 [cs.LG])
    It is challenging for reinforcement learning (RL) algorithms to succeed in real-world applications like financial trading and logistic system due to the noisy observation and environment shifting between training and evaluation. Thus, it requires both high sample efficiency and generalization for resolving real-world tasks. However, directly applying typical RL algorithms can lead to poor performance in such scenarios. Considering the great performance of ensemble methods on both accuracy and generalization in supervised learning (SL), we design a robust and applicable method named Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner. Notably, EPPO combines each policy and the policy ensemble organically and optimizes both simultaneously. In addition, EPPO adopts a diversity enhancement regularization over the policy space which helps to generalize to unseen states and promotes exploration. We theoretically prove EPPO increases exploration efficacy, and through comprehensive experimental evaluations on various tasks, we demonstrate that EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods. Code and supplemental materials are available at https://seqml.github.io/eppo.
    FlowPool: Pooling Graph Representations with Wasserstein Gradient Flows. (arXiv:2112.09990v2 [cs.LG] UPDATED)
    In several machine learning tasks for graph structured data, the graphs under consideration may be composed of a varying number of nodes. Therefore, it is necessary to design pooling methods that aggregate the graph representations of varying size to representations of fixed size which can be used in downstream tasks, such as graph classification. Existing graph pooling methods offer no guarantee with regards to the similarity of a graph representation and its pooled version. In this work, we address this limitation by proposing FlowPool, a pooling method that optimally preserves the statistics of a graph representation to its pooled counterpart by minimising their Wasserstein distance. This is achieved by performing a Wasserstein gradient flow with respect to the pooled graph representation. Our method relies on a versatile implementation which can take into account the geometry of the representation space through any ground cost and computes the gradient of the Wasserstein distance with automatic differentiation. We propose the differentiation of the Wasserstein flow layer using an implicit differentiation scheme. Therefore, our pooling method is amenable to automatic differentiation and can be integrated in end-to-end deep learning architectures. Further, FlowPool is invariant to permutations and can therefore be combined with permutation equivariant feature extraction layers in GNNs in order to obtain predictions that are independent of the ordering of the nodes. Experimental results demonstrate that our method leads to an increase in performance compared to existing pooling methods when evaluated on graph classification.
    Time Series Generation with Masked Autoencoder. (arXiv:2201.07006v3 [cs.LG] UPDATED)
    This paper shows that masked autoencoder with extrapolator (ExtraMAE) is a scalable self-supervised model for time series generation. ExtraMAE randomly masks some patches of the original time series and learns temporal dynamics by recovering the masked patches. Our approach has two core designs. First, ExtraMAE is self-supervised. Supervision allows ExtraMAE to effectively and efficiently capture the temporal dynamics of the original time series. Second, ExtraMAE proposes an extrapolator to disentangle two jobs of the decoder: recovering latent representations and mapping them back into the feature space. These unique designs enable ExtraMAE to consistently and significantly outperform state-of-the-art (SoTA) benchmarks in time series generation. The lightweight architecture also makes ExtraMAE fast and scalable. ExtraMAE shows outstanding behavior in various downstream tasks such as time series classification, prediction, and imputation. As a self-supervised generative model, ExtraMAE allows explicit management of the synthetic data. We hope this paper will usher in a new era of time series generation with self-supervised models.
    Speeding Up Entmax. (arXiv:2111.06832v3 [cs.CL] UPDATED)
    Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $\alpha$-entmax of Peters et al. (2019, arXiv:1905.05702) solves this problem, but is considerably slower than softmax. In this paper, we propose an alternative to $\alpha$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.
    Theory of Acceleration of Decision Making by Correlated Time Sequences. (arXiv:2203.16004v3 [cs.LG] UPDATED)
    Photonic accelerators have been intensively studied to provide enhanced information processing capability to benefit from the unique attributes of physical processes. Recently, it has been reported that chaotically oscillating ultrafast time series from a laser, called laser chaos, provides the ability to solve multi-armed bandit (MAB) problems or decision-making problems at GHz order. Furthermore, it has been confirmed that the negatively correlated time-domain structure of laser chaos contributes to the acceleration of decision-making. However, the underlying mechanism of why decision-making is accelerated by correlated time series is unknown. In this paper, we demonstrate a theoretical model to account for the acceleration of decision-making by correlated time sequence. We first confirm the effectiveness of the negative autocorrelation inherent in time series for solving two-armed bandit problems using Fourier transform surrogate methods. We propose a theoretical model that concerns the correlated time series subjected to the decision-making system and the internal status of the system therein in a unified manner, inspired by correlated random walks. We demonstrate that the performance derived analytically by the theory agrees well with the numerical simulations, which confirms the validity of the proposed model and leads to optimal system design. The present study paves the new way for the effectiveness of correlated time series for decision-making, impacting artificial intelligence and other applications.
    Overcoming challenges in leveraging GANs for few-shot data augmentation. (arXiv:2203.16662v2 [stat.ML] UPDATED)
    In this paper, we explore the use of GAN-based few-shot data augmentation as a method to improve few-shot classification performance. We perform an exploration into how a GAN can be fine-tuned for such a task (one of which is in a class-incremental manner), as well as a rigorous empirical investigation into how well these models can perform to improve few-shot classification. We identify issues related to the difficulty of training such generative models under a purely supervised regime with very few examples, as well as issues regarding the evaluation protocols of existing works. We also find that in this regime, classification accuracy is highly sensitive to how the classes of the dataset are randomly split. Therefore, we propose a semi-supervised fine-tuning approach as a more pragmatic way forward to address these problems.
    Deep Dynamic Effective Connectivity Estimation from Multivariate Time Series. (arXiv:2202.02393v3 [cs.LG] UPDATED)
    Recently, methods that represent data as a graph, such as graph neural networks (GNNs) have been successfully used to learn data representations and structures to solve classification and link prediction problems. The applications of such methods are vast and diverse, but most of the current work relies on the assumption of a static graph. This assumption does not hold for many highly dynamic systems, where the underlying connectivity structure is non-stationary and is mostly unobserved. Using a static model in these situations may result in sub-optimal performance. In contrast, modeling changes in graph structure with time can provide information about the system whose applications go beyond classification. Most work of this type does not learn effective connectivity and focuses on cross-correlation between nodes to generate undirected graphs. An undirected graph is unable to capture direction of an interaction which is vital in many fields, including neuroscience. To bridge this gap, we developed dynamic effective connectivity estimation via neural network training (DECENNT), a novel model to learn an interpretable directed and dynamic graph induced by the downstream classification/prediction task. DECENNT outperforms state-of-the-art (SOTA) methods on five different tasks and infers interpretable task-specific dynamic graphs. The dynamic graphs inferred from functional neuroimaging data align well with the existing literature and provide additional information. Additionally, the temporal attention module of DECENNT identifies time-intervals crucial for predictive downstream task from multivariate time series data.
    Backdoor Detection in Reinforcement Learning. (arXiv:2202.03609v2 [cs.LG] UPDATED)
    While the real world application of reinforcement learning (RL) is becoming popular, the safety concern and the robustness of an RL system require more attention. A recent work reveals that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. We propose the problem of RL Backdoor Detection, aiming to address this safety vulnerability. An interesting observation we drew from extensive empirical studies is a trigger smoothness property where normal actions similar to the backdoor trigger actions can also trigger low performance of the trojan agent. Inspired by this observation, we propose a reinforcement learning solution TrojanSeeker to find approximate trigger actions for the trojan agents, and further propose an efficient approach to mitigate the trojan agents based on machine unlearning. Experiments show that our approach can correctly distinguish and mitigate all the trojan agents across various types of agents and environments.
    STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence. (arXiv:2201.09857v4 [cs.LG] UPDATED)
    It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Short-Term VOlatility-controlled Policy Search (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.
    BR-NPA: A Non-Parametric High-Resolution Attention Model to improve the Interpretability of Attention. (arXiv:2106.02566v4 [cs.CV] UPDATED)
    The prevalence of employing attention mechanisms has brought along concerns on the interpretability of attention distributions. Although it provides insights about how a model is operating, utilizing attention as the explanation of model predictions is still highly dubious. The community is still seeking more interpretable strategies for better identifying local active regions that contribute the most to the final decision. To improve the interpretability of existing attention models, we propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The target model is first distilled to have higher-resolution intermediate feature maps. From which, representative features are then grouped based on local pairwise feature similarity, to produce finer-grained, more precise attention maps highlighting task-relevant parts of the input. The obtained attention maps are ranked according to the activity level of the compound feature, which provides information regarding the important level of the highlighted regions. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. Extensive quantitative and qualitative experiments showcase more comprehensive and accurate visual explanations compared to state-of-the-art attention models and visualizations methods across multiple tasks including fine-grained image classification, few-shot classification, and person re-identification, without compromising the classification accuracy. The proposed visualization model sheds imperative light on how neural networks `pay their attention' differently in different tasks.
    k-strip: A novel segmentation algorithm in k-space for the application of skull stripping. (arXiv:2205.09706v1 [eess.IV])
    Objectives: Present a novel deep learning-based skull stripping algorithm for magnetic resonance imaging (MRI) that works directly in the information rich k-space. Materials and Methods: Using two datasets from different institutions with a total of 36,900 MRI slices, we trained a deep learning-based model to work directly with the complex raw k-space data. Skull stripping performed by HD-BET (Brain Extraction Tool) in the image domain were used as the ground truth. Results: Both datasets were very similar to the ground truth (DICE scores of 92\%-98\% and Hausdorff distances of under 5.5 mm). Results on slices above the eye-region reach DICE scores of up to 99\%, while the accuracy drops in regions around the eyes and below, with partially blurred output. The output of k-strip often smoothed edges at the demarcation to the skull. Binary masks are created with an appropriate threshold. Conclusion: With this proof-of-concept study, we were able to show the feasibility of working in the k-space frequency domain, preserving phase information, with consistent results. Future research should be dedicated to discovering additional ways the k-space can be used for innovative image analysis and further workflows.
    Challenges in Deploying Machine Learning: a Survey of Case Studies. (arXiv:2011.09926v3 [cs.LG] UPDATED)
    In recent years, machine learning has transitioned from a field of academic research interest to a field capable of solving real-world business problems. However, the deployment of machine learning models in production systems can present a number of issues and concerns. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. By mapping found challenges to the steps of the machine learning deployment workflow we show that practitioners face issues at each stage of the deployment process. The goal of this paper is to lay out a research agenda to explore approaches addressing these challenges.
    Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition. (arXiv:2203.17066v2 [eess.SP] UPDATED)
    Gesture recognition is one of the most intuitive ways of interaction and has gathered particular attention for human computer interaction. Radar sensors possess multiple intrinsic properties, such as their ability to work in low illumination, harsh weather conditions, and being low-cost and compact, making them highly preferable for a gesture recognition solution. However, most literature work focuses on solutions with a limited range that is lower than a meter. We propose a novel architecture for a long-range (1m - 2m) gesture recognition solution that leverages a point cloud-based cross-learning approach from camera point cloud to 60-GHz FMCW radar point cloud, which allows learning better representations while suppressing noise. We use a variant of Dynamic Graph CNN (DGCNN) for the cross-learning, enabling us to model relationships between the points at a local and global level and to model the temporal dynamics a Bi-LSTM network is employed. In the experimental results section, we demonstrate our model's overall accuracy of 98.4% for five gestures and its generalization capability.
    Heterogeneous Multi-task Learning with Expert Diversity. (arXiv:2106.10595v2 [cs.LG] UPDATED)
    Predicting multiple heterogeneous biological and medical targets is a challenge for traditional deep learning models. In contrast to single-task learning, in which a separate model is trained for each target, multi-task learning (MTL) optimizes a single model to predict multiple related targets simultaneously. To address this challenge, we propose the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx). Our work aims to tackle the heterogeneous MTL setting, in which the same model optimizes multiple tasks with different characteristics. Such a scenario can overwhelm current MTL approaches due to the challenges in balancing shared and task-specific representations and the need to optimize tasks with competing optimization paths. Our method makes two key contributions: first, we introduce an approach to induce more diversity among experts, thus creating representations more suitable for highly imbalanced and heterogenous MTL learning; second, we adopt a two-step optimization [6, 11] approach to balancing the tasks at the gradient level. We validate our method on three MTL benchmark datasets, including Medical Information Mart for Intensive Care (MIMIC-III) and PubChem BioAssay (PCBA).
    On a class of data-driven mixed-integer programming problems under uncertainty: a distributionally robust approach. (arXiv:2105.14139v3 [math.OC] UPDATED)
    In this study we analyze linear mixed-integer programming problems, in which the distribution of the cost vector is only observable through a finite training data set. In contrast to the related studies, we assume that the number of random observations for each component of the cost vector may vary. Then the goal is to find a prediction rule that converts the data set into an estimate of the expected value of the objective function and a prescription rule that provides an associated estimate of the optimal decision. We aim at finding the least conservative prediction and prescription rules, which satisfy some specified asymptotic guarantees as the sample size tends to infinity. We demonstrate that under some mild assumption the resulting vector optimization problems admit a Pareto optimal solution with some attractive theoretical properties. In particular, this solution can be obtained by solving a distributionally robust optimization (DRO) problem with respect to all probability distributions with given component-wise relative entropy distances from the empirical marginal distributions. It turns out that the outlined DRO problem can be solved rather effectively whenever there exists an effective algorithm for the respective deterministic problem. In addition, we perform numerical experiments where the out-of-sample performance of the proposed approach is analyzed.
    Scalable Multi-view Clustering with Graph Filtering. (arXiv:2205.09228v1 [cs.LG])
    With the explosive growth of multi-source data, multi-view clustering has attracted great attention in recent years. Most existing multi-view methods operate in raw feature space and heavily depend on the quality of original feature representation. Moreover, they are often designed for feature data and ignore the rich topology structure information. Accordingly, in this paper, we propose a generic framework to cluster both attribute and graph data with heterogeneous features. It is capable of exploring the interplay between feature and structure. Specifically, we first adopt graph filtering technique to eliminate high-frequency noise to achieve a clustering-friendly smooth representation. To handle the scalability challenge, we develop a novel sampling strategy to improve the quality of anchors. Extensive experiments on attribute and graph benchmarks demonstrate the superiority of our approach with respect to state-of-the-art approaches.
    Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification. (arXiv:2205.09619v1 [cs.LG])
    The reliability of neural networks is essential for their use in safety-critical applications. Existing approaches generally aim at improving the robustness of neural networks to either real-world distribution shifts (e.g., common corruptions and perturbations, spatial transformations, and natural adversarial examples) or worst-case distribution shifts (e.g., optimized adversarial examples). In this work, we propose the Decision Region Quantification (DRQ) algorithm to improve the robustness of any differentiable pre-trained model against both real-world and worst-case distribution shifts in the data. DRQ analyzes the robustness of local decision regions in the vicinity of a given data point to make more reliable predictions. We theoretically motivate the DRQ algorithm by showing that it effectively smooths spurious local extrema in the decision surface. Furthermore, we propose an implementation using targeted and untargeted adversarial attacks. An extensive empirical evaluation shows that DRQ increases the robustness of adversarially and non-adversarially trained models against real-world and worst-case distribution shifts on several computer vision benchmark datasets.
    Robust and Efficient Medical Imaging with Self-Supervision. (arXiv:2205.09723v1 [cs.CV])
    Recent progress in Medical Artificial Intelligence (AI) has delivered systems that can reach clinical expert level performance. However, such systems tend to demonstrate sub-optimal "out-of-distribution" performance when evaluated in clinical settings different from the training environment. A common mitigation strategy is to develop separate systems for each clinical setting using site-specific data [1]. However, this quickly becomes impractical as medical data is time-consuming to acquire and expensive to annotate [2]. Thus, the problem of "data-efficient generalization" presents an ongoing difficulty for Medical AI development. Although progress in representation learning shows promise, their benefits have not been rigorously studied, specifically for out-of-distribution settings. To meet these challenges, we present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI. REMEDIS uses a generic combination of large-scale supervised transfer learning with self-supervised learning and requires little task-specific customization. We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data. REMEDIS exhibits significantly improved in-distribution performance with up to 11.5% relative improvement in diagnostic accuracy over a strong supervised baseline. More importantly, our strategy leads to strong data-efficient generalization of medical imaging AI, matching strong supervised baselines using between 1% to 33% of retraining data across tasks. These results suggest that REMEDIS can significantly accelerate the life-cycle of medical imaging AI development thereby presenting an important step forward for medical imaging AI to deliver broad impact.
    Overcoming Language Disparity in Online Content Classification with Multimodal Learning. (arXiv:2205.09744v1 [cs.LG])
    Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, sidelining a majority of the languages spoken globally. While existing research has developed better multilingual and monolingual language models to bridge this language disparity between English and non-English languages, we explore the promise of incorporating the information contained in images via multimodal machine learning. Our comparative analyses on three detection tasks focusing on crisis information, fake news, and emotion recognition, as well as five high-resource non-English languages, demonstrate that: (a) detection frameworks based on pre-trained large language models like BERT and multilingual-BERT systematically perform better on the English language compared against non-English languages, and (b) including images via multimodal learning bridges this performance gap. We situate our findings with respect to existing work on the pitfalls of large language models, and discuss their theoretical and practical implications. Resources for this paper are available at https://multimodality-language-disparity.github.io/.
    A Causal Bandit Approach to Learning Good Atomic Interventions in Presence of Unobserved Confounders. (arXiv:2107.02772v2 [cs.LG] UPDATED)
    We study the problem of determining the best intervention in a Causal Bayesian Network (CBN) specified only by its causal graph. We model this as a stochastic multi-armed bandit (MAB) problem with side-information, where the interventions correspond to the arms of the bandit instance. First, we propose a simple regret minimization algorithm that takes as input a semi-Markovian causal graph with atomic interventions and possibly unobservable variables, and achieves $\tilde{O}(\sqrt{M/T})$ expected simple regret, where $M$ is dependent on the input CBN and could be very small compared to the number of arms. We also show that this is almost optimal for CBNs described by causal graphs having an $n$-ary tree structure. Our simple regret minimization results, both upper and lower bound, subsume previous results in the literature, which assumed additional structural restrictions on the input causal graph. In particular, our results indicate that the simple regret guarantee of our proposed algorithm can only be improved by considering more nuanced structural restrictions on the causal graph. Next, we propose a cumulative regret minimization algorithm that takes as input a general causal graph with all observable nodes and atomic interventions and performs better than the optimal MAB algorithm that does not take causal side-information into account. We also experimentally compare both our algorithms with the best known algorithms in the literature. To the best of our knowledge, this work gives the first simple and cumulative regret minimization algorithms for CBNs with general causal graphs under atomic interventions and having unobserved confounders.
    Diverse Weight Averaging for Out-of-Distribution Generalization. (arXiv:2205.09739v1 [cs.CV])
    Standard neural networks struggle to generalize under distribution shifts. For out-of-distribution generalization in computer vision, the best current approach averages the weights along a training run. In this paper, we propose Diverse Weight Averaging (DiWA) that makes a simple change to this strategy: DiWA averages the weights obtained from several independent training runs rather than from a single run. Perhaps surprisingly, averaging these weights performs well under soft constraints despite the network's nonlinearities. The main motivation behind DiWA is to increase the functional diversity across averaged models. Indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between DiWA and standard functional ensembling. Moreover, this decomposition highlights that DiWA succeeds when the variance term dominates, which we show happens when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on the competitive DomainBed benchmark without inference overhead.
    Bayesian Network Structure Learning using Digital Annealer. (arXiv:2006.06926v3 [cs.LG] UPDATED)
    Annealing processors, which solve a quadratic unconstrained binary optimization (QUBO), are a potential breakthrough in improving the accuracy of score-based Bayesian network structure learning. However, currently, the bit capacity of an annealing processor is very limited. To utilize the power of annealing processors, it is necessary to encode score-based learning problems into QUBO within the upper bound of bits. In this paper, we propose a novel approach with the decomposition of candidate parent sets. Experimental results on benchmark networks with $37$ to $223$ variables show that our approach requires lesser bits than the bit capacity of the fourth-generation Fujitsu Digital Annealer, a fully coupled annealing processor developed with semiconductor technology. Moreover, we demonstrate that the Digital Annealer with our conversion method outperforms existing algorithms on some benchmark networks. It is expected that our approach promotes the utility of annealing processors in learning the Bayesian network.
    Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. (arXiv:2205.09702v1 [cs.LG])
    Graph neural networks (GNNs) are among the most powerful tools in deep learning. They routinely solve complex problems on unstructured networks, such as node classification, graph classification, or link prediction, with high accuracy. However, both inference and training of GNNs are complex, and they uniquely combine the features of irregular graph processing with dense and regular computations. This complexity makes it very challenging to execute GNNs efficiently on modern massively parallel architectures. To alleviate this, we first design a taxonomy of parallelism in GNNs, considering data and model parallelism, and different forms of pipelining. Then, we use this taxonomy to investigate the amount of parallelism in numerous GNN models, GNN-driven machine learning tasks, software frameworks, or hardware accelerators. We use the work-depth model, and we also assess communication volume and synchronization. We specifically focus on the sparsity/density of the associated tensors, in order to understand how to effectively apply techniques such as vectorization. We also formally analyze GNN pipelining, and we generalize the established Message-Passing class of GNN models to cover arbitrary pipeline depths, facilitating future optimizations. Finally, we investigate different forms of asynchronicity, navigating the path for future asynchronous parallel GNN pipelines. The outcomes of our analysis are synthesized in a set of insights that help to maximize GNN performance, and a comprehensive list of challenges and opportunities for further research into efficient GNN computations. Our work will help to advance the design of future GNNs.
    Flexible Modeling and Multitask Learning using Differentiable Tree Ensembles. (arXiv:2205.09717v1 [cs.LG])
    Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those by popular toolkits.
    Spherical Perspective on Learning with Normalization Layers. (arXiv:2006.13382v3 [cs.LG] UPDATED)
    Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis outlines phenomena that previous variants of Adam act on and their importance in the optimization process are experimentally validated.
    Provably Precise, Succinct and Efficient Explanations for Decision Trees. (arXiv:2205.09569v1 [cs.AI])
    Decision trees (DTs) embody interpretable classifiers. DTs have been advocated for deployment in high-risk applications, but also for explaining other complex classifiers. Nevertheless, recent work has demonstrated that predictions in DTs ought to be explained with rigorous approaches. Although rigorous explanations can be computed in polynomial time for DTs, their size may be beyond the cognitive limits of human decision makers. This paper investigates the computation of {\delta}-relevant sets for DTs. {\delta}-relevant sets denote explanations that are succinct and provably precise. These sets represent generalizations of rigorous explanations, which are precise with probability one, and so they enable trading off explanation size for precision. The paper proposes two logic encodings for computing smallest {\delta}-relevant sets for DTs. The paper further devises a polynomial-time algorithm for computing {\delta}-relevant sets which are not guaranteed to be subset-minimal, but for which the experiments show to be most often subset-minimal in practice. The experimental results also demonstrate the practical efficiency of computing smallest {\delta}-relevant sets.
    Metrics of calibration for probabilistic predictions. (arXiv:2205.09680v1 [math.ST])
    Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes, "reliability diagrams" help detect and diagnose statistically significant discrepancies -- so-called "miscalibration" -- between the predictions and the outcomes. The canonical reliability diagrams histogram the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram as a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise.
    Dexterous Robotic Manipulation using Deep Reinforcement Learning and Knowledge Transfer for Complex Sparse Reward-based Tasks. (arXiv:2205.09683v1 [cs.RO])
    This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142{\deg} to 76{\deg} during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.
    Neural network topological snake models for locating general phase diagrams. (arXiv:2205.09699v1 [cond-mat.stat-mech])
    Machine learning for locating phase diagram has received intensive research interest in recent years. However, its application in automatically locating phase diagram is limited to single closed phase boundary. In this paper, in order to locate phase diagrams with multiple phases and complex boundaries, we introduce (i) a network-shaped snake model and (ii) a topologically transformable snake with discriminative cooperative networks, respectively. The phase diagrams of both quantum and classical spin-1 model are obtained. Our method is flexible to determine the phase diagram with just snapshots of configurations from the cold-atom or other experiments.
    Semi-WTC: A Practical Semi-supervised Framework for Attack Categorization through Weight-Task Consistency. (arXiv:2205.09669v1 [cs.CR])
    Supervised learning has been widely used for attack detection, which requires large amounts of high-quality data and labels. However, the data is often imbalanced and sufficient annotations are difficult to obtain. Moreover, these supervised models are subject to real-world deployment issues, such as defending against unseen artificial attacks. We propose a semi-supervised fine-grained attack categorization framework consisting of an encoder and a two-branch structure to integrate information from labeled and unlabeled data to tackle these practical challenges. This framework can be generalized to different supervised models. The multilayer perceptron with residual connection and batch normalization is used as the encoder to extract features and reduce the complexity. The Recurrent Prototype Module (RPM) is proposed to train the encoder effectively in a semi-supervised manner. To alleviate the problem of data imbalance, we introduce the Weight-Task Consistency (WTC) into the iterative process of RPM by assigning larger weights to classes with fewer samples in the loss function. In addition, to cope with new attacks in real-world deployment, we further propose an Active Adaption Resampling (AAR) method, which can better discover the distribution of the unseen sample data and adapt the parameters of the encoder. Experimental results show that our model outperforms the state-of-the-art semi-supervised attack detection methods with a general 5% improvement in classification accuracy and a 90% reduction in training time.
    Discovering Dynamic Functional Brain Networks via Spatial and Channel-wise Attention. (arXiv:2205.09576v1 [cs.CV])
    Using deep learning models to recognize functional brain networks (FBNs) in functional magnetic resonance imaging (fMRI) has been attracting increasing interest recently. However, most existing work focuses on detecting static FBNs from entire fMRI signals, such as correlation-based functional connectivity. Sliding-window is a widely used strategy to capture the dynamics of FBNs, but it is still limited in representing intrinsic functional interactive dynamics at each time step. And the number of FBNs usually need to be set manually. More over, due to the complexity of dynamic interactions in brain, traditional linear and shallow models are insufficient in identifying complex and spatially overlapped FBNs across each time step. In this paper, we propose a novel Spatial and Channel-wise Attention Autoencoder (SCAAE) for discovering FBNs dynamically. The core idea of SCAAE is to apply attention mechanism to FBNs construction. Specifically, we designed two attention modules: 1) spatial-wise attention (SA) module to discover FBNs in the spatial domain and 2) a channel-wise attention (CA) module to weigh the channels for selecting the FBNs automatically. We evaluated our approach on ADHD200 dataset and our results indicate that the proposed SCAAE method can effectively recover the dynamic changes of the FBNs at each fMRI time step, without using sliding windows. More importantly, our proposed hybrid attention modules (SA and CA) do not enforce assumptions of linearity and independence as previous methods, and thus provide a novel approach to better understanding dynamic functional brain networks.
    Jacobian Granger Causal Neural Networks for Analysis of Stationary and Nonstationary Data. (arXiv:2205.09573v1 [cs.LG])
    Granger causality is a commonly used method for uncovering information flow and dependencies in a time series. Here we introduce JGC (Jacobian Granger Causality), a neural network-based approach to Granger causality using the Jacobian as a measure of variable importance, and propose a thresholding procedure for inferring Granger causal variables using this measure. The resulting approach performs consistently well compared to other approaches in identifying Granger causal variables, the associated time lags, as well as interaction signs. Lastly, through the inclusion of a time variable, we show that this approach is able to learn the temporal dependencies for nonstationary systems whose Granger causal structures change in time.
    Detect Professional Malicious User with Metric Learning in Recommender Systems. (arXiv:2205.09673v1 [cs.IR])
    In e-commerce, online retailers are usually suffering from professional malicious users (PMUs), who utilize negative reviews and low ratings to their consumed products on purpose to threaten the retailers for illegal profits. Specifically, there are three challenges for PMU detection: 1) professional malicious users do not conduct any abnormal or illegal interactions (they never concurrently leave too many negative reviews and low ratings at the same time), and they conduct masking strategies to disguise themselves. Therefore, conventional outlier detection methods are confused by their masking strategies. 2) the PMU detection model should take both ratings and reviews into consideration, which makes PMU detection a multi-modal problem. 3) there are no datasets with labels for professional malicious users in public, which makes PMU detection an unsupervised learning problem. To this end, we propose an unsupervised multi-modal learning model: MMD, which employs Metric learning for professional Malicious users Detection with both ratings and reviews. MMD first utilizes a modified RNN to project the informational review into a sentiment score, which jointly considers the ratings and reviews. Then professional malicious user profiling (MUP) is proposed to catch the sentiment gap between sentiment scores and ratings. MUP filters the users and builds a candidate PMU set. We apply a metric learning-based clustering to learn a proper metric matrix for PMU detection. Finally, we can utilize this metric and labeled users to detect PMUs. Specifically, we apply the attention mechanism in metric learning to improve the model's performance. The extensive experiments in four datasets demonstrate that our proposed method can solve this unsupervised detection problem. Moreover, the performance of the state-of-the-art recommender models is enhanced by taking MMD as a preprocessing stage.
    Are Graph Representation Learning Methods Robust to Graph Sparsity and Asymmetric Node Information?. (arXiv:2205.09648v1 [cs.LG])
    The growing popularity of Graph Representation Learning (GRL) methods has resulted in the development of a large number of models applied to a miscellany of domains. Behind this diversity of domains, there is a strong heterogeneity of graphs, making it difficult to estimate the expected performance of a model on a new graph, especially when the graph has distinctive characteristics that have not been encountered in the benchmark yet. To address this, we have developed an experimental pipeline, to assess the impact of a given property on the models performances. In this paper, we use this pipeline to study the effect of two specificities encountered on banks transactional graphs resulting from the partial view a bank has on all the individuals and transactions carried out on the market. These specific features are graph sparsity and asymmetric node information. This study demonstrates the robustness of GRL methods to these distinctive characteristics. We believe that this work can ease the evaluation of GRL methods to specific characteristics and foster the development of such methods on transactional graphs.
    HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering. (arXiv:2205.09721v1 [cs.LG])
    The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is ``more'' tree-like, when evaluated in terms of Gromov's $\delta$ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, $\ell_p$ norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94\%$.
    Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow. (arXiv:2205.09703v1 [cs.LG])
    In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%.
    What killed the Convex Booster ?. (arXiv:2205.09628v1 [cs.LG])
    A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: \textit{parameterisation}.
    EXACT: How to Train Your Accuracy. (arXiv:2205.09615v1 [cs.LG])
    Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, Hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.
    Named Entity Recognition, Multi-Task Learning, Nested Entities, BERT, Arabic NER Corpus. (arXiv:2205.09651v1 [cs.CL])
    This paper presents Wojood, a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. Wojood consists of about 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types including person, organization, location, event and date. More importantly, the corpus is annotated with nested entities instead of the more common flat annotations. The data contains about 75K entities and 22.5% of which are nested. The inter-annotator evaluation of the corpus demonstrated a strong agreement with Cohen's Kappa of 0.979 and an F1-score of 0.976. To validate our data, we used the corpus to train a nested NER model based on multi-task learning and AraBERT (Arabic BERT). The model achieved an overall micro F1-score of 0.884. Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
    The AI Mechanic: Acoustic Vehicle Characterization Neural Networks. (arXiv:2205.09667v1 [cs.SD])
    In a world increasingly dependent on road-based transportation, it is essential to understand vehicles. We introduce the AI mechanic, an acoustic vehicle characterization deep learning system, as an integrated approach using sound captured from mobile devices to enhance transparency and understanding of vehicles and their condition for non-expert users. We develop and implement novel cascading architectures for vehicle understanding, which we define as sequential, conditional, multi-level networks that process raw audio to extract highly-granular insights. To showcase the viability of cascading architectures, we build a multi-task convolutional neural network that predicts and cascades vehicle attributes to enhance fault detection. We train and test these models on a synthesized dataset reflecting more than 40 hours of augmented audio and achieve >92% validation set accuracy on attributes (fuel type, engine configuration, cylinder count and aspiration type). Our cascading architecture additionally achieved 93.6% validation and 86.8% test set accuracy on misfire fault prediction, demonstrating margins of 16.4% / 7.8% and 4.2% / 1.5% improvement over na\"ive and parallel baselines. We explore experimental studies focused on acoustic features, data augmentation, feature fusion, and data reliability. Finally, we conclude with a discussion of broader implications, future directions, and application areas for this work.
    ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD. (arXiv:2205.09685v1 [cs.CL])
    Using pre-trained transformer models such as BERT has proven to be effective in many NLP tasks. This paper presents our work to fine-tune BERT models for Arabic Word Sense Disambiguation (WSD). We treated the WSD task as a sentence-pair binary classification task. First, we constructed a dataset of labeled Arabic context-gloss pairs (~167k pairs) we extracted from the Arabic Ontology and the large lexicographic database available at Birzeit University. Each pair was labeled as True or False and target words in each context were identified and annotated. Second, we used this dataset for fine-tuning three pre-trained Arabic BERT models. Third, we experimented the use of different supervised signals used to emphasize target words in context. Our experiments achieved promising results (accuracy of 84%) although we used a large set of senses in the experiment.
    A Topological Approach for Semi-Supervised Learning. (arXiv:2205.09617v1 [cs.CV])
    Nowadays, Machine Learning and Deep Learning methods have become the state-of-the-art approach to solve data classification tasks. In order to use those methods, it is necessary to acquire and label a considerable amount of data; however, this is not straightforward in some fields, since data annotation is time consuming and might require expert knowledge. This challenge can be tackled by means of semi-supervised learning methods that take advantage of both labelled and unlabelled data. In this work, we present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA), a field that is gaining importance for analysing large amounts of data with high variety and dimensionality. In particular, we have created two semi-supervised learning methods following two different topological approaches. In the former, we have used a homological approach that consists in studying the persistence diagrams associated with the data using the Bottleneck and Wasserstein distances. In the latter, we have taken into account the connectivity of the data. In addition, we have carried out a thorough analysis of the developed methods using 3 synthetic datasets, 5 structured datasets, and 2 datasets of images. The results show that the semi-supervised methods developed in this work outperform both the results obtained with models trained with only manually labelled data, and those obtained with classical semi-supervised learning methods, reaching improvements of up to a 16%.
    Data Valuation for Offline Reinforcement Learning. (arXiv:2205.09550v1 [cs.LG])
    The success of deep reinforcement learning (DRL) hinges on the availability of training data, which is typically obtained via a large number of environment interactions. In many real-world scenarios, costs and risks are associated with gathering these data. The field of offline reinforcement learning addresses these issues through outsourcing the collection of data to a domain expert or a carefully monitored program and subsequently searching for a batch-constrained optimal policy. With the emergence of data markets, an alternative to constructing a dataset in-house is to purchase external data. However, while state-of-the-art offline reinforcement learning approaches have shown a lot of promise, they currently rely on carefully constructed datasets that are well aligned with the intended target domains. This raises questions regarding the transferability and robustness of an offline reinforcement learning agent trained on externally acquired data. In this paper, we empirically evaluate the ability of the current state-of-the-art offline reinforcement learning approaches to coping with the source-target domain mismatch within two MuJoCo environments, finding that current state-of-the-art offline reinforcement learning algorithms underperform in the target domain. To address this, we propose data valuation for offline reinforcement learning (DVORL), which allows us to identify relevant and high-quality transitions, improving the performance and transferability of policies learned by offline reinforcement learning algorithms. The results show that our method outperforms offline reinforcement learning baselines on two MuJoCo environments.
    Focused Adversarial Attacks. (arXiv:2205.09624v1 [cs.LG])
    Recent advances in machine learning show that neural models are vulnerable to minimally perturbed inputs, or adversarial examples. Adversarial algorithms are optimization problems that minimize the accuracy of ML models by perturbing inputs, often using a model's loss function to craft such perturbations. State-of-the-art object detection models are characterized by very large output manifolds due to the number of possible locations and sizes of objects in an image. This leads to their outputs being sparse and optimization problems that use them incur a lot of unnecessary computation. We propose to use a very limited subset of a model's learned manifold to compute adversarial examples. Our \textit{Focused Adversarial Attacks} (FA) algorithm identifies a small subset of sensitive regions to perform gradient-based adversarial attacks. FA is significantly faster than other gradient-based attacks when a model's manifold is sparsely activated. Also, its perturbations are more efficient than other methods under the same perturbation constraints. We evaluate FA on the COCO 2017 and Pascal VOC 2007 detection datasets.
    How catastrophic can catastrophic forgetting be in linear regression?. (arXiv:2205.09588v1 [cs.LG])
    To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.
    Disentangling Active and Passive Cosponsorship in the U.S. Congress. (arXiv:2205.09674v1 [cs.LG])
    In the U.S. Congress, legislators can use active and passive cosponsorship to support bills. We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the bill's content. To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88. Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.
    Smooth densities and generative modeling with unsupervised random forests. (arXiv:2205.09435v1 [stat.ML])
    Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.
    IFTT-PIN: A PIN-Entry Method Leveraging the Self-Calibration Paradigm. (arXiv:2205.09534v1 [cs.HC])
    IFTT-PIN is a self-calibrating version of the PIN-entry method introduced in Roth et al. (2004) [1]. In [1], digits are split into two sets and assigned a color respectively. To communicate their digit, users press the button with the same color that is assigned to their digit, which can thus be identified by elimination after a few iterations. IFTT-PIN uses the same principle but does not pre-assign colors to each button. Instead, users are free to choose which button to use for each color. The button-to-color mapping only exists in the user's mind and is never directly communicated to the interface. In other words, IFTT-PIN infers both the user's PIN and their preferred button-to-color mapping at the same time, a process called self-calibration. In this paper, we present online interactive demonstrations of IFTT-PIN (available at https://github.com/jgrizou/IFTT-PIN), with and without self-calibration, and introduce the key concepts and assumptions making self-calibration possible. We review related work in the field of brain-computer interface and further propose self-calibration as a novel approach to protect users against shoulder surfing attacks. Finally, we introduce a vault cracking challenge as a test of usability and security that was informally tested at our institute. With IFTT-PIN, we wish to demonstrate a new interactive experience where users can decide actively and on-the-fly how to use an interface. The self-calibration paradigm might lead to novel opportunities for interaction in other applications or domains. We hope this work will inspire the community to invent them.
    Data-driven prediction of Air Traffic Controllers reactions to resolving conflicts. (arXiv:2205.09539v1 [cs.AI])
    With the aim to enhance automation in conflict detection and resolution (CD&R) tasks in the Air Traffic Management domain, in this paper we propose deep learning techniques (DL) that can learn models of Air Traffic Controllers' (ATCO) reactions in resolving conflicts that can violate separation minimum constraints among aircraft trajectories: This implies learning when the ATCO will react towards resolving a conflict, and how he/she will react. Timely reactions, to which this paper aims, focus on when do reactions happen, aiming to predict the trajectory points, as the trajectory evolves, that the ATCO issues a conflict resolution action, while also predicting the type of resolution action (if any). Towards this goal, the paper formulates the ATCO reactions prediction problem for CD&R, and presents DL methods that can model ATCO timely reactions and evaluates these methods in real-world data sets, showing their efficacy in prediction with very high accuracy.
    Parallel bandit architecture based on laser chaos for reinforcement learning. (arXiv:2205.09543v1 [cs.ET])
    Accelerating artificial intelligence by photonics is an active field of study aiming to exploit the unique properties of photons. Reinforcement learning is an important branch of machine learning, and photonic decision-making principles have been demonstrated with respect to the multi-armed bandit problems. However, reinforcement learning could involve a massive number of states, unlike previously demonstrated bandit problems where the number of states is only one. Q-learning is a well-known approach in reinforcement learning that can deal with many states. The architecture of Q-learning, however, does not fit well photonic implementations due to its separation of update rule and the action selection. In this study, we organize a new architecture for multi-state reinforcement learning as a parallel array of bandit problems in order to benefit from photonic decision-makers, which we call parallel bandit architecture for reinforcement learning or PBRL in short. Taking a cart-pole balancing problem as an instance, we demonstrate that PBRL adapts to the environment in fewer time steps than Q-learning. Furthermore, PBRL yields faster adaptation when operated with a chaotic laser time series than the case with uniformly distributed pseudorandom numbers where the autocorrelation inherent in the laser chaos provides a positive effect. We also find that the variety of states that the system undergoes during the learning phase exhibits completely different properties between PBRL and Q-learning. The insights obtained through the present study are also beneficial for existing computing platforms, not just photonic realizations, in accelerating performances by the PBRL algorithms and correlated random sequences.
    Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters. (arXiv:2205.09470v1 [cs.LG])
    The ever-growing model size and scale of compute have attracted increasing interests in training deep learning models over multiple nodes. However, when it comes to training on cloud clusters, especially across remote clusters, huge challenges are faced. In this work, we introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs). We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning. To balance the accuracy and communication efficiency, in Nebula-I, parameter-efficient training strategies, hybrid parallel computing methods and adaptive communication acceleration techniques are jointly applied. Meanwhile, security strategies are employed to guarantee the safety, reliability and privacy in intra-cluster computation and inter-cluster communication. Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware, e.g. GPU and NPU. Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance. By using Nebula-I, users can run large-scale training tasks over cloud clusters with minimum developments, and the utility of existed large pre-trained models could be further promoted. We also introduced new state-of-the-art results on cross-lingual natural language inference tasks, which are generated based upon a novel learning framework and Nebula-I.
    The Impact of COVID-19 Pandemic on LGBTQ Online Communitie. (arXiv:2205.09511v1 [cs.SI])
    The COVID-19 pandemic has disproportionately impacted the lives of minorities, such as members of the LGBTQ community (lesbian, gay, bisexual, transgender, and queer) due to pre-existing social disadvantages and health disparities. Although extensive research has been carried out on the impact of the COVID-19 pandemic on different aspects of the general population's lives, few studies are focused on the LGBTQ population. In this paper, we identify a group of Twitter users who self-disclose to belong to the LGBTQ community. We develop and evaluate two sets of machine learning classifiers using a pre-pandemic and a during pandemic dataset to identify Twitter posts exhibiting minority stress, which is a unique pressure faced by the members of the LGBTQ population due to their sexual and gender identities. For this task, we collect a set of 20,593,823 posts by 7,241 self-disclosed LGBTQ users and annotate a randomly selected subset of 2800 posts. We demonstrate that our best pre-pandemic and during pandemic models show strong and stable performance for detecting posts that contain minority stress. We investigate the linguistic differences in minority stress posts across pre- and during-pandemic periods. We find that anger words are strongly associated with minority stress during the COVID-19 pandemic. We explore the impact of the pandemic on the emotional states of the LGBTQ population by conducting controlled comparisons with the general population. We adopt propensity score-based matching to perform a causal analysis. The results show that the LBGTQ population have a greater increase in the usage of cognitive words and worsened observable attribute in the usage of positive emotion words than the group of the general population with similar pre-pandemic behavioral attributes.
    Transformers as Neural Augmentors: Class Conditional Sentence Generation via Variational Bayes. (arXiv:2205.09391v1 [cs.CL])
    Data augmentation methods for Natural Language Processing tasks are explored in recent years, however they are limited and it is hard to capture the diversity on sentence level. Besides, it is not always possible to perform data augmentation on supervised tasks. To address those problems, we propose a neural data augmentation method, which is a combination of Conditional Variational Autoencoder and encoder-decoder Transformer model. While encoding and decoding the input sentence, our model captures the syntactic and semantic representation of the input language with its class condition. Following the developments in the past years on pre-trained language models, we train and evaluate our models on several benchmarks to strengthen the downstream tasks. We compare our method with 3 different augmentation techniques. The presented results show that, our model increases the performance of current models compared to other data augmentation techniques with a small amount of computation power.
    Action Conditioned Tactile Prediction: a case study on slip prediction. (arXiv:2205.09430v1 [cs.RO])
    Tactile predictive models can be useful across several robotic manipulation tasks, e.g. robotic pushing, robotic grasping, slip avoidance, and in-hand manipulation. However, available tactile prediction models are mostly studied for image-based tactile sensors and there is no comparison study indicating the best performing models. In this paper, we presented two novel data-driven action-conditioned models for predicting tactile signals during real-world physical robot interaction tasks (1) action condition tactile prediction and (2) action conditioned tactile-video prediction models. We use a magnetic-based tactile sensor that is challenging to analyse and test state-of-the-art predictive models and the only existing bespoke tactile prediction model. We compare the performance of these models with those of our proposed models. We perform the comparison study using our novel tactile enabled dataset containing 51,000 tactile frames of a real-world robotic manipulation task with 11 flat-surfaced household objects. Our experimental results demonstrate the superiority of our proposed tactile prediction models in terms of qualitative, quantitative and slip prediction scores.
    A Boosting Algorithm for Positive-Unlabeled Learning. (arXiv:2205.09485v1 [cs.LG])
    Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. A lot of PU methods based on linear models and neural networks have been proposed; however, there still lacks study on how the theoretically sound boosting-style algorithms could work with P and U data. Considering that in some scenarios when neural networks cannot perform as good as boosting algorithms even with fully-supervised data, we propose a novel boosting algorithm for PU learning: Ada-PU, which compares against neural networks. Ada-PU follows the general procedure of AdaBoost while two different distributions of P data are maintained and updated. After a weak classifier is learned on the newly updated distribution, the corresponding combining weight for the final ensemble is estimated using only PU data. We demonstrated that with a smaller set of base classifiers, the proposed method is guaranteed to keep the theoretical properties of boosting algorithm. In experiments, we showed that Ada-PU outperforms neural networks on benchmark PU datasets. We also study a real-world dataset UNSW-NB15 in cyber security and demonstrated that Ada-PU has superior performance for malicious activities detection.
    Predictive Maintenance using Machine Learning. (arXiv:2205.09402v1 [cs.LG])
    Predictive maintenance (PdM) is a concept, which is implemented to effectively manage maintenance plans of the assets by predicting their failures with data driven techniques. In these scenarios, data is collected over a certain period of time to monitor the state of equipment. The objective is to find some correlations and patterns that can help predict and ultimately prevent failures. Equipment in manufacturing industry are often utilized without a planned maintenance approach. Such practise frequently results in unexpected downtime, owing to certain unexpected failures. In scheduled maintenance, the condition of the manufacturing equipment is checked after fixed time interval and if any fault occurs, the component is replaced to avoid unexpected equipment stoppages. On the flip side, this leads to increase in time for which machine is non-functioning and cost of carrying out the maintenance. The emergence of Industry 4.0 and smart systems have led to increasing emphasis on predictive maintenance (PdM) strategies that can reduce the cost of downtime and increase the availability (utilization rate) of manufacturing equipment. PdM also has the potential to bring about new sustainable practices in manufacturing by fully utilizing the useful lives of components.
    Simple Regularisation for Uncertainty-Aware Knowledge Distillation. (arXiv:2205.09526v1 [cs.LG])
    Considering uncertainty estimation of modern neural networks (NNs) is one of the most important steps towards deploying machine learning systems to meaningful real-world applications such as in medicine, finance or autonomous systems. At the moment, ensembles of different NNs constitute the state-of-the-art in both accuracy and uncertainty estimation in different tasks. However, ensembles of NNs are unpractical under real-world constraints, since their computation and memory consumption scale linearly with the size of the ensemble, which increase their latency and deployment cost. In this work, we examine a simple regularisation approach for distribution-free knowledge distillation of ensemble of machine learning models into a single NN. The aim of the regularisation is to preserve the diversity, accuracy and uncertainty estimation characteristics of the original ensemble without any intricacies, such as fine-tuning. We demonstrate the generality of the approach on combinations of toy data, SVHN/CIFAR-10, simple to complex NN architectures and different tasks.
    Threshold Designer Adaptation: Improved Adaptation for Designers in Co-creative Systems. (arXiv:2205.09269v1 [cs.LG])
    To best assist human designers with different styles, Machine Learning (ML) systems need to be able to adapt to them. However, there has been relatively little prior work on how and when to best adapt an ML system to a co-designer. In this paper we present threshold designer adaptation: a novel method for adapting a creative ML model to an individual designer. We evaluate our approach with a human subject study using a co-creative rhythm game design tool. We find that designers prefer our proposed method and produce higher quality content in comparison to an existing baseline.
    CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies. (arXiv:2205.09433v1 [cs.LG])
    Reinforcement Learning has drawn huge interest as a tool for solving optimal control problems. Solving a given problem (task or environment) involves converging towards an optimal policy. However, there might exist multiple optimal policies that can dramatically differ in their behaviour; for example, some may be faster than the others but at the expense of greater risk. We consider and study a distribution of optimal policies. We design a curiosity-augmented Metropolis algorithm (CAMEO), such that we can sample optimal policies, and such that these policies effectively adopt diverse behaviours, since this implies greater coverage of the different possible optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems, and even in the challenging case of environments that provide sparse rewards. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability, and represents a first step towards learning the distribution of optimal policies itself.
    An Approach to Investigate Public Opinion, Views, and Perspectives Towards Exoskeleton Technology. (arXiv:2205.09151v1 [cs.HC])
    Over the last decade, exoskeletons have had an extensive impact on different disciplines and application domains such as assisted living, military, healthcare, firefighting, and industries, on account of their diverse and dynamic functionalities to augment human abilities, stamina, potential, and performance in a multitude of ways. In view of this wide-scale applicability and use-cases of exoskeletons, it is crucial to investigate and analyze the public opinion, views, and perspectives towards exoskeletons which would help to interpret the effectiveness of the underlining human-robot, human-machine, and human-technology interactions. The Internet of Everything era of today's living, characterized by people spending more time on the internet than ever before, holds the potential for the investigation of the same by mining and analyzing relevant web behavior, specifically from social media, that can be interpreted to understand public opinion, views, and perspectives towards a topic or set of topics. Therefore, this paper aims to address this research challenge related to exoskeletons by utilizing the potential of web behavior-based Big Data mining in the modern-day Internet of Everything era. As Twitter is one of the most popular social media platforms on a global scale - characterized by both the number of users and the amount of time spent by its users on the platform - this work focused on investigating web behavior on Twitter to interpret the public opinion, views, and perspectives towards exoskeleton technology. A total of approximately 20,000 tweets related to exoskeletons were used to evaluate the effectiveness of the proposed approach. The results presented and discussed uphold the efficacy of the proposed approach to interpret and analyze the public opinion, views, and perspectives towards exoskeletons from the associated tweets.
    COVID-19 Monitoring System using Social Distancing and Face Mask Detection on Surveillance video datasets. (arXiv:2110.03905v2 [cs.CV] UPDATED)
    In the current times, the fear and danger of COVID-19 virus still stands large. Manual monitoring of social distancing norms is impractical with a large population moving about and with insufficient task force and resources to administer them. There is a need for a lightweight, robust and 24X7 video-monitoring system that automates this process. This paper proposes a comprehensive and effective solution to perform person detection, social distancing violation detection, face detection and face mask classification using object detection, clustering and Convolution Neural Network (CNN) based binary classifier. For this, YOLOv3, Density-based spatial clustering of applications with noise (DBSCAN), Dual Shot Face Detector (DSFD) and MobileNetV2 based binary classifier have been employed on surveillance video datasets. This paper also provides a comparative study of different face detection and face mask classification models. Finally, a video dataset labelling method is proposed along with the labelled video dataset to compensate for the lack of dataset in the community and is used for evaluation of the system. The system performance is evaluated in terms of accuracy, F1 score as well as the prediction time, which has to be low for practical applicability. The system performs with an accuracy of 91.2% and F1 score of 90.79% on the labelled video dataset and has an average prediction time of 7.12 seconds for 78 frames of a video.
    Turbulent field fluctuations in gyrokinetic and fluid plasmas. (arXiv:2107.09744v2 [physics.plasm-ph] CROSS LISTED)
    A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but significant questions exist regarding its predictive ability. To this end, using a novel physics-informed deep learning framework, we demonstrate the first ever direct quantitative comparisons of turbulent field fluctuations between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling with good overall agreement found in magnetized helical plasmas at low normalized pressure. This framework is readily adaptable to experimental and astrophysical environments, and presents a new technique for the numerical validation and discovery of reduced global plasma turbulence models.
    Federated Learning: Applications, Challenges and Future Scopes. (arXiv:2205.09513v1 [cs.LG])
    Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context of FL, several privacy methods are described, including secure multiparty computation, homomorphic encryption, differential privacy, and stochastic gradient descent. Furthermore, a review of various FL classes, such as horizontal and vertical FL and federated transfer learning, is provided. FL has applications in wireless communication, service recommendation, intelligent medical diagnosis systems, and healthcare, all of which are discussed in this paper. We also present a thorough review of existing FL challenges, such as privacy protection, communication cost, system heterogeneity, and unreliable model upload, followed by future research directions.
    Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking. (arXiv:2205.09638v1 [cs.IR])
    In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy off against computational efficiency in an empirical fashion, lacking theoretical guarantees. In this paper, we propose the concept of certified error control of candidate set pruning for relevance ranking, which means that the test error after pruning is guaranteed to be controlled under a user-specified threshold with high probability. Both in-domain and out-of-domain experiments show that our method successfully prunes the first-stage retrieved candidate sets to improve the second-stage reranking speed while satisfying the pre-specified accuracy constraints in both settings. For example, on MS MARCO Passage v1, our method yields an average candidate set size of 27 out of 1,000 which increases the reranking speed by about 37 times, while the MRR@10 is greater than a pre-specified value of 0.38 with about 90% empirical coverage and the empirical baselines fail to provide such guarantee. Code and data are available at: https://github.com/alexlimh/CEC-Ranking.
    On the Convergence of Policy in Unregularized Policy Mirror Descent. (arXiv:2205.08176v2 [math.OC] UPDATED)
    In this short note, we give the convergence analysis of the policy in the recent famous policy mirror descent (PMD). We mainly consider the unregularized setting following [11] with generalized Bregman divergence. The difference is that we directly give the convergence rates of policy under generalized Bregman divergence. Our results are inspired by the convergence of value function in previous works and are an extension study of policy mirror descent. Though some results have already appeared in previous work, we further discover a large body of Bregman divergences could give finite-step convergence to an optimal policy, such as the classical Euclidean distance.
    Bi-LSTM Scoring Based Similarity Measurement with Agglomerative Hierarchical Clustering (AHC) for Speaker Diarization. (arXiv:2205.09709v1 [eess.AS])
    Majority of speech signals across different scenarios are never available with well-defined audio segments containing only a single speaker. A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences. Recent advancements in diarization technology leverage neural network-based approaches to improvise multiple subsystems of speaker diarization system comprising of extracting segment-wise embedding features and detecting changes in the speaker during conversation. However, to identify speaker through clustering, models depend on methodologies like PLDA to generate similarity measure between two extracted segments from a given conversational audio. Since these algorithms ignore the temporal structure of conversations, they tend to achieve a higher Diarization Error Rate (DER), thus leading to misdetections both in terms of speaker and change identification. Therefore, to compare similarity of two speech segments both independently and sequentially, we propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix. Once the similarity matrix is generated, Agglomerative Hierarchical Clustering (AHC) is applied to further identify speaker segments based on thresholding. To evaluate the performance, Diarization Error Rate (DER%) metric is used. The proposed model achieves a low DER of 34.80% on a test set of audio samples derived from ICSI Meeting Corpus as compared to traditional PLDA based similarity measurement mechanism which achieved a DER of 39.90%.
    Can language models learn from explanations in context?. (arXiv:2204.02329v2 [cs.CL] UPDATED)
    Large language models can perform new tasks by adapting to a few in-context examples. For humans, rapid learning from examples can benefit from explanations that connect examples to task principles. We therefore investigate whether explanations of few-shot examples can allow language models to adapt more effectively. We annotate a set of 40 challenging tasks from BIG-Bench with explanations of answers to a small subset of questions, as well as a variety of matched control explanations. We evaluate the effects of various zero-shot and few-shot prompts that include different types of explanations, instructions, and controls on the performance of a range of large language models. We analyze these results using statistical multilevel modeling techniques that account for the nested dependencies among conditions, tasks, prompts, and models. We find that explanations of examples can improve performance. Adding untuned explanations to a few-shot prompt offers a modest improvement in performance; about 1/3 the effect size of adding few-shot examples, but twice the effect size of task instructions. We then show that explanations tuned for performance on a small validation set offer substantially larger benefits; building a prompt by selecting examples and explanations together substantially improves performance over selecting examples alone. Hand-tuning explanations can substantially improve performance on challenging tasks. Furthermore, even untuned explanations outperform carefully matched controls, suggesting that the benefits are due to the link between an example and its explanation, rather than lower-level features of the language used. However, only large models can benefit from explanations. In summary, explanations can support the in-context learning abilities of large language models on challenging tasks.
    Diagonal State Spaces are as Effective as Structured State Spaces. (arXiv:2203.14343v3 [cs.LG] UPDATED)
    Modeling long range dependencies in sequential data is a fundamental step towards attaining human-level performance in many modalities such as text, vision, audio and video. While attention-based models are a popular and effective choice in modeling short-range interactions, their performance on tasks requiring long range reasoning has been largely inadequate. In an exciting result, Gu et al. (ICLR 2022) proposed the $\textit{Structured State Space}$ (S4) architecture delivering large gains over state-of-the-art models on several long-range tasks across various modalities. The core proposition of S4 is the parameterization of state matrices via a diagonal plus low rank structure, allowing efficient computation. In this work, we show that one can match the performance of S4 even without the low rank correction and thus assuming the state matrices to be diagonal. Our $\textit{Diagonal State Space}$ (DSS) model matches the performance of S4 on Long Range Arena tasks, speech classification on Speech Commands dataset, while being conceptually simpler and straightforward to implement.
    TourBERT: A pretrained language model for the tourism industry. (arXiv:2201.07449v3 [cs.CL] UPDATED)
    The Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most important and state-of-the-art models for natural language. However, it has also been shown that for domain-specific tasks it is helpful to pretrain BERT on a domain-specific corpus. In this paper, we present TourBERT, a pretrained language model for tourism. We describe how TourBERT was developed and evaluated. The evaluations show that TourBERT is outperforming BERT in all tourism-specific tasks.
    Generalization Analysis of Message Passing Neural Networks on Large Random Graphs. (arXiv:2202.00645v4 [cs.LG] UPDATED)
    Message passing neural networks (MPNN) have seen a steep rise in popularity since their introduction as generalizations of convolutional neural networks to graph-structured data, and are now considered state-of-the-art tools for solving a large variety of graph-focused problems. We study the generalization error of MPNNs in graph classification and regression. We assume that graphs of different classes are sampled from different random graph models. We show that, when training a MPNN on a dataset sampled from such a distribution, the generalization gap increases in the complexity of the MPNN, and decreases, not only with respect to the number of training samples, but also with the average number of nodes in the graphs. This shows how a MPNN with high complexity can generalize from a small dataset of graphs, as long as the graphs are large. The generalization bound is derived from a uniform convergence result, that shows that any MPNN, applied on a graph, approximates the MPNN applied on the geometric model that the graph discretizes.
    Learning Graph Structure from Convolutional Mixtures. (arXiv:2205.09575v1 [cs.LG])
    Machine learning frameworks such as graph neural networks typically rely on a given, fixed graph to exploit relational inductive biases and thus effectively learn from network data. However, when said graphs are (partially) unobserved, noisy, or dynamic, the problem of inferring graph structure from data becomes relevant. In this paper, we postulate a graph convolutional relationship between the observed and latent graphs, and formulate the graph learning task as a network inverse (deconvolution) problem. In lieu of eigendecomposition-based spectral methods or iterative optimization solutions, we unroll and truncate proximal gradient iterations to arrive at a parameterized neural network architecture that we call a Graph Deconvolution Network (GDN). GDNs can learn a distribution of graphs in a supervised fashion, perform link prediction or edge-weight regression tasks by adapting the loss function, and they are inherently inductive. We corroborate GDN's superior graph recovery performance and its generalization to larger graphs using synthetic data in supervised settings. Furthermore, we demonstrate the robustness and representation power of GDNs on real world neuroimaging and social network datasets.
    Cracking White-box DNN Watermarks via Invariant Neuron Transforms. (arXiv:2205.00199v2 [cs.CR] UPDATED)
    Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry. In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On $9$ state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.
    Understanding Gradient Descent on Edge of Stability in Deep Learning. (arXiv:2205.09745v1 [cs.LG])
    Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) {\em Normalized GD}, i.e., GD with a varying LR $ \eta_t =\frac{ \eta }{ || \nabla L(x(t)) || } $ and loss $L$; (2) GD with constant LR and loss $\sqrt{L}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{\max}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.
    Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers. (arXiv:2205.09546v1 [stat.ML])
    In this work, we provide an exact likelihood alternative to the variational training of generative autoencoders. We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for any regularization terms. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures, making our approach a drop-in replacement for the training of existing VAEs and VAE-style models. We refer to the resulting models as Autoencoders within Flows (AEF), since the encoder, decoder and prior are defined as individual layers of an overall invertible architecture. We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance. In a broad sense, the main ambition of this work is to close the gap between the normalizing flow and autoencoder literature under the common framework of invertibility and exact maximum likelihood.
    A Simple Yet Effective SVD-GCN for Directed Graphs. (arXiv:2205.09335v1 [cs.LG])
    In this paper, we propose a simple yet effective graph neural network for directed graphs (digraph) based on the classic Singular Value Decomposition (SVD), named SVD-GCN. The new graph neural network is built upon the graph SVD-framelet to better decompose graph signals on the SVD ``frequency'' bands. Further the new framelet SVD-GCN is also scaled up for larger scale graphs via using Chebyshev polynomial approximation. Through empirical experiments conducted on several node classification datasets, we have found that SVD-GCN has remarkable improvements in a variety of graph node learning tasks and it outperforms GCN and many other state-of-the-art graph neural networks for digraphs. Moreover, we empirically demonstate that the SVD-GCN has great denoising capability and robustness to high level graph data attacks. The theoretical and experimental results prove that the SVD-GCN is effective on a variant of graph datasets, meanwhile maintaining stable and even better performance than the state-of-the-arts.
    Bayesian Negative Sampling for Recommendation. (arXiv:2204.06520v2 [cs.IR] UPDATED)
    How to sample high quality negative instances from unlabeled data, i.e., negative sampling, is important for training implicit collaborative filtering and contrastive learning models. Although previous studies have proposed some approaches to sample informative instances, few has been done to discriminating false negative from true negative for unbiased negative sampling. On the basis of our order relation analysis of negatives' scores, we first derive the class conditional density of true negatives and that of false negatives. We next design a Bayesian classifier for negative classification, from which we define a model-agnostic posterior probability estimate of an instance being true negative as a quantitative negative signal measure. We also propose a Bayesian optimal sampling rule to sample high-quality negatives. The proposed Bayesian Negative Sampling (BNS) algorithm has a linear time complexity. Experimental studies validate the superiority of BNS over the peers in terms of better sampling quality and better recommendation performance.
    PSI Draft Specification. (arXiv:2205.09488v1 [cs.SE])
    This document presents the draft specification for delivering machine learning services over HTTP, developed as part of the Protocols and Structures for Inference project, which concluded in 2013. It presents the motivation for providing machine learning as a service, followed by a description of the essential and optional components of such a service.
    Improving VAE based molecular representations for compound property prediction. (arXiv:2201.04929v3 [cs.LG] UPDATED)
    Collecting labeled data for many important tasks in chemoinformatics is time consuming and requires expensive experiments. In recent years, machine learning has been used to learn rich representations of molecules using large scale unlabeled molecular datasets and transfer the knowledge to solve the more challenging tasks with limited datasets. Variational autoencoders are one of the tools that have been proposed to perform the transfer for both chemical property prediction and molecular generation tasks. In this work we propose a simple method to improve chemical property prediction performance of machine learning models by incorporating additional information on correlated molecular descriptors in the representations learned by variational autoencoders. We verify the method on three property prediction asks. We explore the impact of the number of incorporated descriptors, correlation between the descriptors and the target properties, sizes of the datasets etc. Finally, we show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset in the representation space.
    Morse-STF: Improved Protocols for Privacy-Preserving Machine Learning. (arXiv:2109.11726v2 [cs.CR] UPDATED)
    Secure multi-party computation enables multiple mutually distrusting parties to perform computations on data without revealing the data itself, and has become one of the core technologies behind privacy-preserving machine learning. In this work, we present several improved privacy-preserving protocols for both linear and non-linear layers in machine learning. For linear layers, we present an extended beaver triple protocol for bilinear maps that significantly reduces communication of convolution layer. For non-linear layers, we introduce novel protocols for computing the sigmoid and softmax function. Both functions are essential building blocks for machine learning training of classification tasks. Our protocols are both more scalable and robust than prior constructions, and improves runtime performance by 3-17x. Finally, we introduce Morse-STF, an end-to-end privacy-preserving system for machine learning training that leverages all these improved protocols. Our system achieves a 1.8x speedup on logistic regression and 3.9-4.9x speedup on convolutional neural networks compared to prior state-of-the-art systems.
    Differential Privacy: What is all the noise about?. (arXiv:2205.09453v1 [cs.CR])
    Differential Privacy (DP) is a formal definition of privacy that provides rigorous guarantees against risks of privacy breaches during data processing. It makes no assumptions about the knowledge or computational power of adversaries, and provides an interpretable, quantifiable and composable formalism. DP has been actively researched during the last 15 years, but it is still hard to master for many Machine Learning (ML)) practitioners. This paper aims to provide an overview of the most important ideas, concepts and uses of DP in ML, with special focus on its intersection with Federated Learning (FL).
    An Invariant Matching Property for Distribution Generalization under Intervened Response. (arXiv:2205.09162v1 [stat.ME])
    The task of distribution generalization concerns making reliable prediction of a response in unseen environments. The structural causal models are shown to be useful to model distribution changes through intervention. Motivated by the fundamental invariance principle, it is often assumed that the conditional distribution of the response given its predictors remains the same across environments. However, this assumption might be violated in practical settings when the response is intervened. In this work, we investigate a class of model with an intervened response. We identify a novel form of invariance by incorporating the estimates of certain features as additional predictors. Effectively, we show this invariance is equivalent to having a deterministic linear matching that makes the generalization possible. We provide an explicit characterization of the linear matching and present our simulation results under various intervention settings.
    Cross-lingual Transfer of Monolingual Models. (arXiv:2109.07348v2 [cs.CL] UPDATED)
    Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform the native English model independently of the source language. After probing the English linguistic knowledge encoded in the representations before and after transfer, we find that semantic information is retained from the source language, while syntactic information is learned during transfer. Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.
    Neural ODE Control for Trajectory Approximation of Continuity Equation. (arXiv:2205.09241v1 [math.OC])
    We consider the controllability problem for the continuity equation, corresponding to neural ordinary differential equations (ODEs), which describes how a probability measure is pushedforward by the flow. We show that the controlled continuity equation has very strong controllability properties. Particularly, a given solution of the continuity equation corresponding to a bounded Lipschitz vector field defines a trajectory on the set of probability measures. For this trajectory, we show that there exist piecewise constant training weights for a neural ODE such that the solution of the continuity equation corresponding to the neural ODE is arbitrarily close to it. As a corollary to this result, we establish that the continuity equation of the neural ODE is approximately controllable on the set of compactly supported probability measures that are absolutely continuous with respect to the Lebesgue measure.
    Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks. (arXiv:2205.09117v1 [cs.LG])
    Experience replay plays a crucial role in improving the sample efficiency of deep reinforcement learning agents. Recent advances in experience replay propose using Mixup (Zhang et al., 2018) to further improve sample efficiency via synthetic sample generation. We build upon this technique with Neighborhood Mixup Experience Replay (NMER), a geometrically-grounded replay buffer that interpolates transitions with their closest neighbors in state-action space. NMER preserves a locally linear approximation of the transition manifold by only applying Mixup between transitions with vicinal state-action features. Under NMER, a given transition's set of state action neighbors is dynamic and episode agnostic, in turn encouraging greater policy generalizability via inter-episode interpolation. We combine our approach with recent off-policy deep reinforcement learning algorithms and evaluate on continuous control environments. We observe that NMER improves sample efficiency by an average 94% (TD3) and 29% (SAC) over baseline replay buffers, enabling agents to effectively recombine previous experiences and learn from limited data.
    Relational representation learning with spike trains. (arXiv:2205.09140v1 [cs.NE])
    Relational representation learning has lately received an increase in interest due to its flexibility in modeling a variety of systems like interacting particles, materials and industrial projects for, e.g., the design of spacecraft. A prominent method for dealing with relational data are knowledge graph embedding algorithms, where entities and relations of a knowledge graph are mapped to a low-dimensional vector space while preserving its semantic structure. Recently, a graph embedding method has been proposed that maps graph elements to the temporal domain of spiking neural networks. However, it relies on encoding graph elements through populations of neurons that only spike once. Here, we present a model that allows us to learn spike train-based embeddings of knowledge graphs, requiring only one neuron per graph element by fully utilizing the temporal domain of spike patterns. This coding scheme can be implemented with arbitrary spiking neuron models as long as gradients with respect to spike times can be calculated, which we demonstrate for the integrate-and-fire neuron model. In general, the presented results show how relational knowledge can be integrated into spike-based systems, opening up the possibility of merging event-based computing and relational data to build powerful and energy efficient artificial intelligence applications and reasoning systems.
    Accelerated Training of Physics Informed Neural Networks (PINNs) using Meshless Discretizations. (arXiv:2205.09332v1 [cs.LG])
    We present a new technique for the accelerated training of physics-informed neural networks (PINNs): discretely-trained PINNs (DT-PINNs). The repeated computation of partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. The use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries. Additionally, though traditional PINNs (vanilla-PINNs) are typically stored and trained in 32-bit floating-point (fp32) on the GPU, we show that for DT-PINNs, using fp64 on the GPU leads to significantly faster training times than fp32 vanilla-PINNs with comparable accuracy. We demonstrate the efficiency and accuracy of DT-PINNs via a series of experiments. First, we explore the effect of network depth on both numerical and automatic differentiation of a neural network with random weights and show that RBF-FD approximations of third-order accuracy and above are more efficient while being sufficiently accurate. We then compare the DT-PINNs to vanilla-PINNs on both linear and nonlinear Poisson equations and show that DT-PINNs achieve similar losses with 2-4x faster training times on a consumer GPU. Finally, we also demonstrate that similar results can be obtained for the PINN solution to the heat equation (a space-time problem) by discretizing the spatial derivatives using RBF-FD and using automatic differentiation for the temporal derivative. Our results show that fp64 DT-PINNs offer a superior cost-accuracy profile to fp32 vanilla-PINNs.
    Dataset Pruning: Reducing Training Data by Examining Generalization Influence. (arXiv:2205.09329v1 [cs.LG])
    The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct a smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.
    GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling. (arXiv:2205.09379v1 [cs.SE])
    GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.
    PredictionNet: Real-Time Joint Probabilistic Traffic Prediction for Planning, Control, and Simulation. (arXiv:2109.11094v2 [cs.RO] UPDATED)
    Predicting the future motion of traffic agents is crucial for safe and efficient autonomous driving. To this end, we present PredictionNet, a deep neural network (DNN) that predicts the motion of all surrounding traffic agents together with the ego-vehicle's motion. All predictions are probabilistic and are represented in a simple top-down rasterization that allows an arbitrary number of agents. Conditioned on a multi-layer map with lane information, the network outputs future positions, velocities, and backtrace vectors jointly for all agents including the ego-vehicle in a single pass. Trajectories are then extracted from the output. The network can be used to simulate realistic traffic, and it produces competitive results on popular benchmarks. More importantly, it has been used to successfully control a real-world vehicle for hundreds of kilometers, by combining it with a motion planning/control subsystem. The network runs faster than real-time on an embedded GPU, and the system shows good generalization (across sensory modalities and locations) due to the choice of input representation. Furthermore, we demonstrate that by extending the DNN with reinforcement learning (RL), it can better handle rare or unsafe events like aggressive maneuvers and crashes.
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v4 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.
    SEMI: Self-supervised Exploration via Multisensory Incongruity. (arXiv:2009.12494v2 [cs.LG] UPDATED)
    Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v1 [cs.LG])
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with respect to a group $G$, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of "$G$-invariant neural architecture design": What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or "shallow" neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$. The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons. The classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. Based on a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. We draw the network morphisms between the enumerated architectures that can be leveraged during neural architecture search (NAS). Finally, we prove that architectures corresponding to inequivalent cohomology classes in a given cohomology ring coincide in function space only when their weight matrices are zero, and we discuss the implications of this in the context of NAS.
    Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime. (arXiv:1806.04884v3 [stat.ML] UPDATED)
    In this paper, we theoretically prove that the deep ReLU neural networks do not lie in spurious local minima in the loss landscape under the Neural Tangent Kernel (NTK) regime, that is, in the gradient descent training dynamics of the deep ReLU neural networks whose parameters are initialized by a normal distribution in the limit as the widths of the hidden layers tend to infinity.
    Hybrid Intelligent Testing in Simulation-Based Verification. (arXiv:2205.09552v1 [cs.AR])
    Efficient and effective testing for simulation-based hardware verification is challenging. Using constrained random test generation, several millions of tests may be required to achieve coverage goals. The vast majority of tests do not contribute to coverage progress, yet they consume verification resources. In this paper, we propose a hybrid intelligent testing approach combining two methods that have previously been treated separately, namely Coverage-Directed Test Selection and Novelty-Driven Verification. Coverage-Directed Test Selection learns from coverage feedback to bias testing towards the most effective tests. Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli, thereby reducing the number of simulations and increasing testing efficiency. We discuss the strengths and limitations of each method, and we show how our approach addresses each method's limitations, leading to hardware testing that is both efficient and effective.
    Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. (arXiv:2205.09653v1 [stat.ML])
    We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.
    What Is Fairness? Implications For FairML. (arXiv:2205.09622v1 [cs.LG])
    A growing body of literature in fairness-aware ML (fairML) aspires to mitigate machine learning (ML)-related unfairness in automated decision making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods that ensure that trained ML models achieve low values in those measures. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a considerable gap between centuries of philosophical discussion and recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the evaluation of ML models in ADM systems. We derive that fairness problems can already arise without the presence of protected attributes, pointing out that fairness and predictive performance are not irreconcilable counterparts, but rather that the latter is necessary to achieve the former. Moreover, we argue why and how causal considerations are necessary when assessing fairness in the presence of protected attributes. Eventually, we achieve greater linguistic clarity for the discussion of fairML by clearly assigning responsibilities to stakeholders inside and outside ML.
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v1 [cs.LG])
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.
    Towards a Theory of Faithfulness: Faithful Explanations of Differentiable Classifiers over Continuous Data. (arXiv:2205.09620v1 [cs.LG])
    There is broad agreement in the literature that explanation methods should be faithful to the model that they explain, but faithfulness remains a rather vague term. We revisit faithfulness in the context of continuous data and propose two formal definitions of faithfulness for feature attribution methods. Qualitative faithfulness demands that scores reflect the true qualitative effect (positive vs. negative) of the feature on the model and quanitative faithfulness that the magnitude of scores reflect the true quantitative effect. We discuss under which conditions these requirements can be satisfied to which extent (local vs global). As an application of the conceptual idea, we look at differentiable classifiers over continuous data and characterize Gradient-scores as follows: every qualitatively faithful feature attribution method is qualitatively equivalent to Gradient-scores. Furthermore, if an attribution method is quantitatively faithful in the sense that changes of the output of the classifier are proportional to the scores of features, then it is either equivalent to gradient-scoring or it is based on an inferior approximation of the classifier. To illustrate the practical relevance of the theory, we experimentally demonstrate that popular attribution methods can fail to give faithful explanations in the setting where the data is continuous and the classifier differentiable.
    Posterior Matching for Arbitrary Conditioning. (arXiv:2201.12414v3 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables Variational Autoencoders (VAEs) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching is comparable or superior to current state-of-the-art methods for a variety of tasks with an assortment of VAEs (e.g.~discrete, hierarchical, VaDE).
    Automatic Spoken Language Identification using a Time-Delay Neural Network. (arXiv:2205.09564v1 [cs.CL])
    Closed-set spoken language identification is the task of recognizing the language being spoken in a recorded audio clip from a set of known languages. In this study, a language identification system was built and trained to distinguish between Arabic, Spanish, French, and Turkish based on nothing more than recorded speech. A pre-existing multilingual dataset was used to train a series of acoustic models based on the Tedlium TDNN model to perform automatic speech recognition. The system was provided with a custom multilingual language model and a specialized pronunciation lexicon with language names prepended to phones. The trained model was used to generate phone alignments to test data from all four languages, and languages were predicted based on a voting scheme choosing the most common language prepend in an utterance. Accuracy was measured by comparing predicted languages to known languages, and was determined to be very high in identifying Spanish and Arabic, and somewhat lower in identifying Turkish and French.
    Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-based Beam Search. (arXiv:2205.09676v1 [cs.CV])
    Existing trackers usually select a location or proposal with the maximum score as tracking result for each frame. However, such greedy search scheme maybe not the optimal choice, especially when encountering challenging tracking scenarios like heavy occlusions and fast motion. Since the accumulated errors would make response scores not reliable anymore. In this paper, we propose a novel multi-agent reinforcement learning based beam search strategy (termed BeamTracking) to address this issue. Specifically, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. We take the target feature, proposal feature, and its response score as state, and also consider actions predicted by nearby agent, to train multi-agents to select their actions. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
    scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data. (arXiv:2205.09523v1 [stat.ML])
    Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells.
    The First Optimal Acceleration of High-Order Methods in Smooth Convex Optimization. (arXiv:2205.09647v1 [math.OC])
    In this paper, we study the fundamental open question of finding the optimal high-order algorithm for solving smooth convex minimization problems. Arjevani et al. (2019) established the lower bound $\Omega\left(\epsilon^{-2/(3p+1)}\right)$ on the number of the $p$-th order oracle calls required by an algorithm to find an $\epsilon$-accurate solution to the problem, where the $p$-th order oracle stands for the computation of the objective function value and the derivatives up to the order $p$. However, the existing state-of-the-art high-order methods of Gasnikov et al. (2019b); Bubeck et al. (2019); Jiang et al. (2019) achieve the oracle complexity $\mathcal{O}\left(\epsilon^{-2/(3p+1)} \log (1/\epsilon)\right)$, which does not match the lower bound. The reason for this is that these algorithms require performing a complex binary search procedure, which makes them neither optimal nor practical. We fix this fundamental issue by providing the first algorithm with $\mathcal{O}\left(\epsilon^{-2/(3p+1)}\right)$ $p$-th order oracle complexity.
    Machine learning applications for noisy intermediate-scale quantum computers. (arXiv:2205.09414v1 [quant-ph])
    Quantum machine learning has proven to be a fruitful area in which to search for potential applications of quantum computers. This is particularly true for those available in the near term, so called noisy intermediate-scale quantum (NISQ) devices. In this Thesis, we develop and study three quantum machine learning applications suitable for NISQ computers, ordered in terms of increasing complexity of data presented to them. These algorithms are variational in nature and use parameterised quantum circuits (PQCs) as the underlying quantum machine learning model. The first application area is quantum classification using PQCs, where the data is classical feature vectors and their corresponding labels. Here, we study the robustness of certain data encoding strategies in such models against noise present in a quantum computer. The second area is generative modelling using quantum computers, where we use quantum circuit Born machines to learn and sample from complex probability distributions. We discuss and present a framework for quantum advantage for such models, propose gradient-based training methods and demonstrate these both numerically and on the Rigetti quantum computer up to 28 qubits. For our final application, we propose a variational algorithm in the area of approximate quantum cloning, where the data becomes quantum in nature. For the algorithm, we derive differentiable cost functions, prove theoretical guarantees such as faithfulness, and incorporate state of the art methods such as quantum architecture search. Furthermore, we demonstrate how this algorithm is useful in discovering novel implementable attacks on quantum cryptographic protocols, focusing on quantum coin flipping and key distribution as examples.
    CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network. (arXiv:2205.09612v1 [cs.LG])
    In this paper, we propose a Classification Confidence Network (CLCNet) that can determine whether the classification model classifies input samples correctly. It can take a classification result in the form of vector in any dimension, and return a confidence score as output, which represents the probability of an instance being classified correctly. We can utilize CLCNet in a simple cascade structure system consisting of several SOTA (state-of-the-art) classification models, and our experiments show that the system can achieve the following advantages: 1. The system can customize the average computation requirement (FLOPs) per image while inference. 2. Under the same computation requirement, the performance of the system can exceed any model that has identical structure with the model in the system, but different in size. In fact, this is a new type of ensemble modeling. Like general ensemble modeling, it can achieve higher performance than single classification model, yet our system requires much less computation than general ensemble modeling. We have uploaded our code to a github repository: https://github.com/yaoching0/CLCNet-Rethinking-of-Ensemble-Modeling.
    Practical Skills Demand Forecasting via Representation Learning of Temporal Dynamics. (arXiv:2205.09508v1 [econ.GN])
    Rapid technological innovation threatens to leave much of the global workforce behind. Today's economy juxtaposes white-hot demand for skilled labor against stagnant employment prospects for workers unprepared to participate in a digital economy. It is a moment of peril and opportunity for every country, with outcomes measured in long-term capital allocation and the life satisfaction of billions of workers. To meet the moment, governments and markets must find ways to quicken the rate at which the supply of skills reacts to changes in demand. More fully and quickly understanding labor market intelligence is one route. In this work, we explore the utility of time series forecasts to enhance the value of skill demand data gathered from online job advertisements. This paper presents a pipeline which makes one-shot multi-step forecasts into the future using a decade of monthly skill demand observations based on a set of recurrent neural network methods. We compare the performance of a multivariate model versus a univariate one, analyze how correlation between skills can influence multivariate model results, and present predictions of demand for a selection of skills practiced by workers in the information technology industry.
    Gold-standard solutions to the Schr\"odinger equation using deep learning: How much physics do we need?. (arXiv:2205.09438v1 [cs.LG])
    Finding accurate solutions to the Schr\"odinger equation is the key unsolved challenge of computational chemistry. Given its importance for the development of new chemical compounds, decades of research have been dedicated to this problem, but due to the large dimensionality even the best available methods do not yet reach the desired accuracy. Recently the combination of deep learning with Monte Carlo methods has emerged as a promising way to obtain highly accurate energies and moderate scaling of computational cost. In this paper we significantly contribute towards this goal by introducing a novel deep-learning architecture that achieves 40-70% lower energy error at 8x lower computational cost compared to previous approaches. Using our method we establish a new benchmark by calculating the most accurate variational ground state energies ever published for a number of different atoms and molecules. We systematically break down and measure our improvements, focusing in particular on the effect of increasing physical prior knowledge. We surprisingly find that increasing the prior knowledge given to the architecture can actually decrease accuracy.
    Spatial Autoregressive Coding for Graph Neural Recommendation. (arXiv:2205.09489v1 [cs.IR])
    Graph embedding methods including traditional shallow models and deep Graph Neural Networks (GNNs) have led to promising applications in recommendation. Nevertheless, shallow models especially random-walk-based algorithms fail to adequately exploit neighbor proximity in sampled subgraphs or sequences due to their optimization paradigm. GNN-based algorithms suffer from the insufficient utilization of high-order information and easily cause over-smoothing problems when stacking too much layers, which may deteriorate the recommendations of low-degree (long-tail) items, limiting the expressiveness and scalability. In this paper, we propose a novel framework SAC, namely Spatial Autoregressive Coding, to solve the above problems in a unified way. To adequately leverage neighbor proximity and high-order information, we design a novel spatial autoregressive paradigm. Specifically, we first randomly mask multi-hop neighbors and embed the target node by integrating all other surrounding neighbors with an explicit multi-hop attention. Then we reinforce the model to learn a neighbor-predictive coding for the target node by contrasting the coding and the masked neighbors' embedding, equipped with a new hard negative sampling strategy. To learn the minimal sufficient representation for the target-to-neighbor prediction task and remove the redundancy of neighbors, we devise Neighbor Information Bottleneck by maximizing the mutual information between target predictive coding and the masked neighbors' embedding, and simultaneously constraining those between the coding and surrounding neighbors' embedding. Experimental results on both public recommendation datasets and a real scenario web-scale dataset Douyin-Friend-Recommendation demonstrate the superiority of SAC compared with state-of-the-art methods.
    Learning-based AC-OPF Solvers on Realistic Network and Realistic Loads. (arXiv:2205.09452v1 [cs.LG])
    Deep learning approaches for the Alternating Current-Optimal Power Flow (AC-OPF) problem are under active research in recent years. A common shortcoming in this area of research is the lack of a dataset that includes both a realistic power network topology and the corresponding realistic loads. To address this issue, we construct an AC-OPF formulation-ready dataset called TAS-97 that contains realistic network information and realistic bus loads from Tasmania's electricity network. We found that the realistic loads in Tasmania are correlated between buses and they show signs of an underlying multivariate normal distribution. Feasibility-optimized end-to-end deep neural network models are trained and tested on the constructed dataset. Trained on samples with bus loads generated from a fitted multivariate normal distribution, our learning-based AC-OPF solver achieves 0.13% cost optimality gap, 99.73% feasibility rate, and 38.62 times of speedup on realistic testing samples when compared to PYPOWER.
    Why only Micro-F1? Class Weighting of Measures for Relation Classification. (arXiv:2205.09460v1 [cs.CL])
    Relation classification models are conventionally evaluated using only a single measure, e.g., micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro and macro, for imbalanced datasets. We introduce a framework for weighting schemes, where existing schemes are extremes, and two new intermediate schemes. We show that reporting results of different weighting schemes better highlights strengths and weaknesses of a model.
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v1 [stat.ML])
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.
    Mobility, Communication and Computation Aware Federated Learning for Internet of Vehicles. (arXiv:2205.09529v1 [cs.LG])
    While privacy concerns entice connected and automated vehicles to incorporate on-board federated learning (FL) solutions, an integrated vehicle-to-everything communication with heterogeneous computation power aware learning platform is urgently necessary to make it a reality. Motivated by this, we propose a novel mobility, communication and computation aware online FL platform that uses on-road vehicles as learning agents. Thanks to the advanced features of modern vehicles, the on-board sensors can collect data as vehicles travel along their trajectories, while the on-board processors can train machine learning models using the collected data. To take the high mobility of vehicles into account, we consider the delay as a learning parameter and restrict it to be less than a tolerable threshold. To satisfy this threshold, the central server accepts partially trained models, the distributed roadside units (a) perform downlink multicast beamforming to minimize global model distribution delay and (b) allocate optimal uplink radio resources to minimize local model offloading delay, and the vehicle agents conduct heterogeneous local model training. Using real-world vehicle trace datasets, we validate our FL solutions. Simulation shows that the proposed integrated FL platform is robust and outperforms baseline models. With reasonable local training episodes, it can effectively satisfy all constraints and deliver near ground truth multi-horizon velocity and vehicle-specific power predictions.
    Learning Multiscale Convolutional Dictionaries for Image Reconstruction. (arXiv:2011.12815v3 [cs.CV] UPDATED)
    Convolutional neural networks (CNNs) have been tremendously successful in solving imaging inverse problems. To understand their success, an effective strategy is to construct simpler and mathematically more tractable convolutional sparse coding (CSC) models that share essential ingredients with CNNs. Existing CSC methods, however, underperform leading CNNs in challenging inverse problems. We hypothesize that the performance gap may be attributed in part to how they process images at different spatial scales: While many CNNs use multiscale feature representations, existing CSC models mostly rely on single-scale dictionaries. To close the performance gap, we thus propose a multiscale convolutional dictionary structure. The proposed dictionary structure is derived from the U-Net, arguably the most versatile and widely used CNN for image-to-image learning problems. We show that incorporating the proposed multiscale dictionary in an otherwise standard CSC framework yields performance competitive with state-of-the-art CNNs across a range of challenging inverse problems including CT and MRI reconstruction. Our work thus demonstrates the effectiveness and scalability of the multiscale CSC approach in solving challenging inverse problems.
    ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution. (arXiv:2205.09548v1 [q-bio.BM])
    Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in the laboratory, and functional proteins are scarce in the vast sequence space. Machine learning (ML) approaches can accelerate directed evolution by learning to map protein sequences to functions without building a detailed model of the underlying physics, chemistry and biological pathways. Despite the great potentials held by these ML methods, they encounter severe challenges in identifying the most suitable sequences for a targeted function. These failures can be attributed to the common practice of adopting a high-dimensional feature representation for protein sequences and inefficient search methods. To address these issues, we propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution, termed ODBO, which employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We further design an initial sample selection strategy to minimize the number of experimental samples for training ML models. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding of the variants with properties of interest. We expect the ODBO framework to greatly reduce the experimental cost and time cost of directed evolution, and can be further generalized as a powerful tool for adaptive experimental design in a broader context.
    An Introduction to Quantum Machine Learning for Engineers. (arXiv:2205.09510v1 [quant-ph])
    In the current noisy intermediate-scale quantum (NISQ) era, quantum machine learning is emerging as a dominant paradigm to program gate-based quantum computers. In quantum machine learning, the gates of a quantum circuit are parametrized, and the parameters are tuned via classical optimization based on data and on measurements of the outputs of the circuit. Parametrized quantum circuits (PQCs) can efficiently address combinatorial optimization problems, implement probabilistic generative models, and carry out inference (classification and regression). This monograph provides a self-contained introduction to quantum machine learning for an audience of engineers with a background in probability and linear algebra. It first describes the necessary background, concepts, and tools necessary to describe quantum operations and measurements. Then, it covers parametrized quantum circuits, the variational quantum eigensolver, as well as unsupervised and supervised quantum machine learning formulations.
    Neural ODEs with Irregular and Noisy Data. (arXiv:2205.09479v1 [cs.LG])
    Measurement noise is an integral part while collecting data of a physical process. Thus, noise removal is necessary to draw conclusions from these data, and it often becomes essential to construct dynamical models using these data. We discuss a methodology to learn differential equation(s) using noisy and irregular sampled measurements. In our methodology, the main innovation can be seen in the integration of deep neural networks with the neural ordinary differential equations (ODEs) approach. Precisely, we aim at learning a neural network that provides (approximately) an implicit representation of the data and an additional neural network that models the vector fields of the dependent variables. We combine these two networks by constraining using neural ODEs. The proposed framework to learn a model describing the vector field is highly effective under noisy measurements. The approach can handle scenarios where dependent variables are not available at the same temporal grid. Moreover, a particular structure, e.g., second-order with respect to time, can easily be incorporated. We demonstrate the effectiveness of the proposed method for learning models using data obtained from various differential equations and present a comparison with the neural ODE method that does not make any special treatment to noise.
    Variational Inference for Bayesian Bridge Regression. (arXiv:2205.09515v1 [stat.ML])
    We study the implementation of Automatic Differentiation Variational inference (ADVI) for Bayesian inference on regression models with bridge penalization. The bridge approach uses $\ell_{\alpha}$ norm, with $\alpha \in (0, +\infty)$ to define a penalization on large values of the regression coefficients, which includes the Lasso ($\alpha = 1$) and ridge $(\alpha = 2)$ penalizations as special cases. Full Bayesian inference seamlessly provides joint uncertainty estimates for all model parameters. Although MCMC aproaches are available for bridge regression, it can be slow for large dataset, specially in high dimensions. The ADVI implementation allows the use of small batches of data at each iteration (due to stochastic gradient based algorithms), therefore speeding up computational time in comparison with MCMC. We illustrate the approach on non-parametric regression models with B-splines, although the method works seamlessly for other choices of basis functions. A simulation study shows the main properties of the proposed method.
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v1 [cs.LG])
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using backpropagation. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v1 [cs.LG])
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyperparameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyperparameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, a numerical example is provided to explore the advantages of the super approximation power of ReLU NestNets.
    Torchhd: An Open-Source Python Library to Support Hyperdimensional Computing Research. (arXiv:2205.09208v1 [cs.LG])
    Hyperdimensional Computing (HDC) is a neuro-inspired computing framework that exploits high-dimensional random vector spaces. HDC uses extremely parallelizable arithmetic to provide computational solutions that balance accuracy, efficiency and robustness. This has proven especially useful in resource-limited scenarios such as embedded systems. The commitment of the scientific community to aggregate and disseminate research in this particularly multidisciplinary field has been fundamental for its advancement. Adding to this effort, we propose Torchhd, a high-performance open-source Python library for HDC. Torchhd seeks to make HDC more accessible and serves as an efficient foundation for research and application development. The easy-to-use library builds on top of PyTorch and features state-of-the-art HDC functionality, clear documentation and implementation examples from notable publications. Comparing publicly available code with their Torchhd implementation shows that experiments can run up to 104$\times$ faster. Torchhd is available at: https://github.com/hyperdimensional-computing/torchhd
    TransTab: Learning Transferable Tabular Transformers Across Tables. (arXiv:2205.09328v1 [cs.LG])
    Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns. This preprocessing often incurs significant data waste (e.g., removing unmatched columns and samples). How to learn ML models from multiple tables with partially overlapping columns? How to incrementally update ML models as more columns become available over time? Can we leverage model pretraining on multiple distinct tables? How to train an ML model which can predict on an unseen table? To answer all those questions, we propose to relax fixed table structures by introducing a Transferable Tabular Transformer (TransTab) for tables. The goal of TransTab is to convert each sample (a row in the table) to a generalizable embedding vector, and then apply stacked transformers for feature encoding. One methodology insight is combining column description and table cells as the raw input to a gated transformer model. The other insight is to introduce supervised and self-supervised pretraining to improve model performance. We compare TransTab with multiple baseline methods on diverse benchmark datasets and five oncology clinical trial datasets. Overall, TransTab ranks 1.00, 1.00, 1.78 out of 12 methods in supervised learning, feature incremental learning, and transfer learning scenarios, respectively; and the proposed pretraining leads to 2.3\% AUC lift on average over the supervised learning.}
    ESCADA: Efficient Safety and Context Aware Dose Allocation for Precision Medicine. (arXiv:2111.13415v2 [cs.LG] UPDATED)
    Finding an optimal individualized treatment regimen is considered one of the most challenging precision medicine problems. Various patient characteristics influence the response to the treatment, and hence, there is no one-size-fits-all regimen. Moreover, the administration of an unsafe dose during the treatment can have adverse effects on health. Therefore, a treatment model must ensure patient \emph{safety} while \emph{efficiently} optimizing the course of therapy. We study a prevalent medical problem where the treatment aims to keep a physiological variable in a safe range and preferably close to a target level, which we refer to as \emph{leveling}. Such a task may be relevant in numerous other domains as well. We propose ESCADA, a novel and generic multi-armed bandit (MAB) algorithm tailored for the leveling task, to make safe, personalized, and context-aware dose recommendations. We derive high probability upper bounds on its cumulative regret and safety guarantees. Following ESCADA's design, we also describe its Thompson sampling-based counterpart. We discuss why the straightforward adaptations of the classical MAB algorithms such as GP-UCB may not be a good fit for the leveling task. Finally, we make \emph{in silico} experiments on the bolus-insulin dose allocation problem in type-1 diabetes mellitus disease and compare our algorithms against the famous GP-UCB algorithm, the rule-based dose calculators, and a clinician.  ( 2 min )
    Deep Fusion Prior for Multi-Focus Image Super Resolution Fusion. (arXiv:2110.05706v3 [cs.CV] UPDATED)
    This paper unifies the multi-focus images fusion (MFIF) and blind super resolution (SR) problems as the multi-focus image super resolution fusion (MFISRF) task, and proposes a novel unified dataset-free unsupervised framework named deep fusion prior (DFP) to address such MFISRF task. DFP consists of SKIPnet network, DoubleReblur focus measurement tactic, decision embedding module and loss functions. In particular, DFP can obtain MFISRF only from two low-resolution inputs without any extent dataset; SKIPnet implementing unsupervised learning via deep image prior is an end-to-end generated network acting as the engine of DFP; DoubleReblur is used to determine the primary decision map without learning but based on estimated PSF and Gaussian kernels convolution; decision embedding module optimizes the decision map via learning; and DFP losses composed of content loss, joint gradient loss and gradient limit loss can obtain high-quality MFISRF results robustly. Experiments have proved that our proposed DFP approaches and even outperforms those state-of-art MFIF and SR method combinations. Additionally, DFP is a general framework, thus its networks and focus measurement tactics can be continuously updated to further improve the MFISRF performance. DFP codes are open source and will be available soon at this http URL  ( 2 min )
    CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. (arXiv:2103.00858v4 [cs.DB] UPDATED)
    Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. However, existing learned indexes exhibit a performance gap between synthetic and real-world datasets, making them far from practical indexes. In this paper, we identify that ignoring the importance of data partitioning during model training is the main reason for this problem. Thus, we explicitly apply data partitioning to index construction and propose a new efficient and updatable cache-aware RMI framework, called CARMI. Specifically, we introduce entropy as a metric to quantify and characterize the effectiveness of data partitioning of tree nodes in learned indexes and propose a novel cost model, laying a new theoretical foundation for future research. Then, based on our novel cost model, CARMI can automatically determine tree structures and model types under various datasets and workloads by a hybrid construction algorithm without any manual tuning. Furthermore, since memory accesses limit the performance of RMIs, a new cache-aware design is also applied in CARMI, which makes full use of the characteristics of the CPU cache to effectively reduce the number of memory accesses. Our experimental study shows that CARMI performs better than baselines, achieving an average of 2.2x/1.9x speedup compared to B+ Tree/ALEX, while using only about 0.77x memory space of B+ Tree. On the SOSD platform, CARMI outperforms all baselines, with an average speedup of 1.2x over the nearest competitor RMI, which has been carefully tuned for each dataset in advance.  ( 2 min )
    Denoising Noisy Neural Networks: A Bayesian Approach with Compensation. (arXiv:2105.10699v3 [cs.LG] UPDATED)
    Deep neural networks (DNNs) with noisy weights, which we refer to as noisy neural networks (NoisyNNs), arise from the training and inference of DNNs in the presence of noise. NoisyNNs emerge in many new applications, including the wireless transmission of DNNs, the efficient deployment or storage of DNNs in analog devices, and the truncation or quantization of DNN weights. This paper studies a fundamental problem of NoisyNNs: how to reconstruct the DNN weights from their noisy manifestations. While all prior works relied on the maximum likelihood (ML) estimation, this paper puts forth a denoising approach to reconstruct DNNs with the aim of maximizing the inference accuracy of the reconstructed models. The superiority of our denoiser is rigorously proven in two small-scale problems, wherein we consider a quadratic neural network function and a shallow feedforward neural network, respectively. When applied to advanced learning tasks with modern DNN architectures, our denoiser exhibits significantly better performance than the ML estimator. Consider the average test accuracy of the denoised DNN model versus the weight variance to noise power ratio (WNR) performance. When denoising a noisy ResNet34 model arising from noisy inference, our denoiser outperforms ML estimation by up to 4.1 dB to achieve a test accuracy of 60%.When denoising a noisy ResNet18 model arising from noisy training, our denoiser outperforms ML estimation by 13.4 dB and 8.3 dB to achieve test accuracies of 60% and 80%, respectively.  ( 2 min )
    Privacy preserving n-party scalar product protocol. (arXiv:2112.09436v2 [cs.CR] UPDATED)
    Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data, both on horizontal and vertically partitioned data. However, it relies on specialized techniques and algorithms to perform the necessary computations. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. Unfortunately, the solutions currently proposed in the literature focus mainly on two-party scenarios, even though scenarios with a higher number of data parties are becoming more relevant. For example when performing analyses that require counting the number of samples which fulfill certain criteria defined across various sites, such as calculating the information gain at a node in a decision tree. In this paper we propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method. Our proposed solution relies on a recursive resolution of smaller scalar products. After describing our proposed method, we discuss potential scalability issues. Finally, we describe the privacy guarantees and identify any concerns, as well as comparing the proposed method to the original solution in this aspect.  ( 2 min )
    Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients. (arXiv:2109.11788v3 [cs.LG] UPDATED)
    Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it considerably outperforms the existing approaches and improves the state-of-the-art by a significant margin.  ( 2 min )
    CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning. (arXiv:2112.04564v3 [cs.CV] UPDATED)
    In this paper, we propose a novel co-learning framework (CoSSL) with decoupled representation learning and classifier learning for imbalanced SSL. To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE) for classifier learning. Furthermore, the current evaluation protocol for imbalanced SSL focuses only on balanced test sets, which has limited practicality in real-world scenarios. Therefore, we further conduct a comprehensive evaluation under various shifted test distributions. In experiments, we show that our approach outperforms other methods over a large range of shifted distributions, achieving state-of-the-art performance on benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our code will be made publicly available.  ( 2 min )
    M3E2: Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation. (arXiv:2112.07574v2 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.  ( 2 min )
    DBSegment: Fast and robust segmentation of deep brain structures -- Evaluation of transportability across acquisition domains. (arXiv:2110.09473v3 [eess.IV] UPDATED)
    Segmenting deep brain structures from magnetic resonance images is important for patient diagnosis, surgical planning, and research. Most current state-of-the-art solutions follow a segmentation-by-registration approach, where subject MRIs are mapped to a template with well-defined segmentations. However, registration-based pipelines are time-consuming, thus, limiting their clinical use. This paper uses deep learning to provide a robust and efficient deep brain segmentation solution. The method consists of a pre-processing step to conform all MRI images to the same orientation, followed by a convolutional neural network using the nnU-Net framework. We use a total of 14 datasets from both research and clinical collections. Of these, seven were used for training and validation and seven were retained for independent testing. We trained the network to segment 30 deep brain structures, as well as a brain mask, using labels generated from a registration-based approach. We evaluated the generalizability of the network by performing a leave-one-dataset-out cross-validation, and extensive testing on external datasets. Furthermore, we assessed cross-domain transportability by evaluating the results separately on different domains. We achieved an average DSC of 0.89 $\pm$ 0.04 on the independent testing datasets when compared to the registration-based gold standard. On our test system, the computation time decreased from 42 minutes for a reference registration-based pipeline to 1 minute. Our proposed method is fast, robust, and generalizes with high reliability. It can be extended to the segmentation of other brain structures. The method is publicly available on GitHub, as well as a pip package for convenient usage.  ( 3 min )
    Foundation Posteriors for Approximate Probabilistic Inference. (arXiv:2205.09735v1 [cs.LG])
    Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyper-parameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a ``foundation'' posterior able to do zero-shot inference for new programs. The foundation posterior can also be fine-tuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zero-shot and fine-tuned, on a benchmark of STAN programs.  ( 2 min )
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v1 [cs.LG])
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We propose an alternating minimization algorithm to jointly estimate the latent positions of the nodes and other model parameters. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieves superior prediction accuracy and provides more interpretability compared to existing models.
    Deep Learning in Business Analytics: A Clash of Expectations and Reality. (arXiv:2205.09337v1 [cs.LG])
    Our fast-paced digital economy shaped by global competition requires increased data-driven decision-making based on artificial intelligence (AI) and machine learning (ML). The benefits of deep learning (DL) are manifold, but it comes with limitations that have - so far - interfered with widespread industry adoption. This paper explains why DL - despite its popularity - has difficulties speeding up its adoption within business analytics. It is shown - by a mixture of content analysis and empirical study - that the adoption of deep learning is not only affected by computational complexity, lacking big data architecture, lack of transparency (black-box), and skill shortage, but also by the fact that DL does not outperform traditional ML models in the case of structured datasets with fixed-length feature vectors. Deep learning should be regarded as a powerful addition to the existing body of ML models instead of a one size fits all solution.
    Darts: User-Friendly Modern Machine Learning for Time Series. (arXiv:2110.03224v3 [cs.LG] UPDATED)
    We present Darts, a Python machine learning library for time series, with a focus on forecasting. Darts offers a variety of models, from classics such as ARIMA to state-of-the-art deep neural networks. The emphasis of the library is on offering modern machine learning functionalities, such as supporting multidimensional series, meta-learning on multiple series, training on large datasets, incorporating external data, ensembling models, and providing a rich support for probabilistic forecasting. At the same time, great care goes into the API design to make it user-friendly and easy to use. For instance, all models can be used using fit()/predict(), similar to scikit-learn.  ( 2 min )
    Bypassing Logits Bias in Online Class-Incremental Learning with a Generative Framework. (arXiv:2205.09347v1 [cs.LG])
    Continual learning requires the model to maintain the learned knowledge while learning from a non-i.i.d data stream continually. Due to the single-pass training setting, online continual learning is very challenging, but it is closer to the real-world scenarios where quick adaptation to new data is appealing. In this paper, we focus on online class-incremental learning setting in which new classes emerge over time. Almost all existing methods are replay-based with a softmax classifier. However, the inherent logits bias problem in the softmax classifier is a main cause of catastrophic forgetting while existing solutions are not applicable for online settings. To bypass this problem, we abandon the softmax classifier and propose a novel generative framework based on the feature space. In our framework, a generative classifier which utilizes replay memory is used for inference, and the training objective is a pair-based metric learning loss which is proven theoretically to optimize the feature space in a generative way. In order to improve the ability to learn new data, we further propose a hybrid of generative and discriminative loss to train the model. Extensive experiments on several benchmarks, including newly introduced task-free datasets, show that our method beats a series of state-of-the-art replay-based methods with discriminative classifiers, and reduces catastrophic forgetting consistently with a remarkable margin.
    IL-flOw: Imitation Learning from Observation using Normalizing Flows. (arXiv:2205.09251v1 [cs.LG])
    We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only. Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods which require updating the reward model during policy search and are known to be unstable and difficult to optimize. Our method, IL-flOw, recovers the expert policy by modelling state-state transitions, by generating rewards using deep density estimators trained on the demonstration trajectories, avoiding the instability issues of adversarial methods. We demonstrate that using the state transition log-probability density as a reward signal for forward reinforcement learning translates to matching the trajectory distribution of the expert demonstrations, and experimentally show good recovery of the true reward signal as well as state of the art results for imitation from observation on locomotion and robotic continuous control tasks.  ( 2 min )
    Continual Pre-Training Mitigates Forgetting in Language and Vision. (arXiv:2205.09357v1 [cs.LG])
    Pre-trained models are nowadays a fundamental component of machine learning research. In continual learning, they are commonly used to initialize the model before training on the stream of non-stationary data. However, pre-training is rarely applied during continual learning. We formalize and investigate the characteristics of the continual pre-training scenario in both language and vision environments, where a model is continually pre-trained on a stream of incoming data and only later fine-tuned to different downstream tasks. We show that continually pre-trained models are robust against catastrophic forgetting and we provide strong empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols. Code is provided at https://github.com/AndreaCossu/continual-pretraining-nlp-vision .  ( 2 min )
    Consistent Interpolating Ensembles via the Manifold-Hilbert Kernel. (arXiv:2205.09342v1 [stat.ML])
    Recent research in the theory of overparametrized learning has sought to establish generalization guarantees in the interpolating regime. Such results have been established for a few common classes of methods, but so far not for ensemble methods. We devise an ensemble classification method that simultaneously interpolates the training data, and is consistent for a broad class of data distributions. To this end, we define the manifold-Hilbert kernel for data distributed on a Riemannian manifold. We prove that kernel smoothing regression using the manifold-Hilbert kernel is weakly consistent in the setting of Devroye et al. 1998. For the sphere, we show that the manifold-Hilbert kernel can be realized as a weighted random partition kernel, which arises as an infinite ensemble of partition-based classifiers.  ( 2 min )
    SiReN: Sign-Aware Recommendation Using Graph Neural Networks. (arXiv:2108.08735v2 [cs.IR] UPDATED)
    In recent years, many recommender systems using network embedding (NE) such as graph neural networks (GNNs) have been extensively studied in the sense of improving recommendation accuracy. However, such attempts have focused mostly on utilizing only the information of positive user-item interactions with high ratings. Thus, there is a challenge on how to make use of low rating scores for representing users' preferences since low ratings can be still informative in designing NE-based recommender systems. In this study, we present SiReN, a new sign-aware recommender system based on GNN models. Specifically, SiReN has three key components: 1) constructing a signed bipartite graph for more precisely representing users' preferences, which is split into two edge-disjoint graphs with positive and negative edges each, 2) generating two embeddings for the partitioned graphs with positive and negative edges via a GNN model and a multi-layer perceptron (MLP), respectively, and then using an attention model to obtain the final embeddings, and 3) establishing a sign-aware Bayesian personalized ranking (BPR) loss function in the process of optimization. Through comprehensive experiments, we empirically demonstrate that SiReN consistently outperforms state-of-the-art NE-aided recommendation methods.
    Semi-Supervised Learning for Image Classification using Compact Networks in the BioMedical Context. (arXiv:2205.09678v1 [cs.CV])
    The development of mobile and on the edge applications that embed deep convolutional neural models has the potential to revolutionise biomedicine. However, most deep learning models require computational resources that are not available in smartphones or edge devices; an issue that can be faced by means of compact models. The problem with such models is that they are, at least usually, less accurate than bigger models. In this work, we study how this limitation can be addressed with the application of semi-supervised learning techniques. We conduct several statistical analyses to compare performance of deep compact architectures when trained using semi-supervised learning methods for tackling image classification tasks in the biomedical context. In particular, we explore three families of compact networks, and two families of semi-supervised learning techniques for 10 biomedical tasks. By combining semi-supervised learning methods with compact networks, it is possible to obtain a similar performance to standard size networks. In general, the best results are obtained when combining data distillation with MixNet, and plain distillation with ResNet-18. Also, in general, NAS networks obtain better results than manually designed networks and quantized networks. The work presented in this paper shows the benefits of apply semi-supervised methods to compact networks; this allow us to create compact models that are not only as accurate as standard size models, but also faster and lighter. Finally, we have developed a library that simplifies the construction of compact models using semi-supervised learning methods.
    A Graph Data Augmentation Strategy with Entropy Preservation. (arXiv:2107.06048v2 [cs.LG] UPDATED)
    The Graph Convolutional Networks (GCN) proposed by Kipf and Welling is an effective model for semi-supervised learning, but faces the obstacle of over-smoothing, which will weaken the representation ability of GCN. Recently some works are proposed to tackle above limitation by randomly perturbing graph topology or feature matrix to generate data augmentations as input for training. However, these operations inevitably do damage to the integrity of information structures and have to sacrifice the smoothness of feature manifold. In this paper, we first introduce a novel graph entropy definition as a measure to quantitatively evaluate the smoothness of a data manifold and then point out that this graph entropy is controlled by triangle motif-based information structures. Considering the preservation of graph entropy, we propose an effective strategy to generate randomly perturbed training data but maintain both graph topology and graph entropy. Extensive experiments have been conducted on real-world datasets and the results verify the effectiveness of our proposed method in improving semi-supervised node classification accuracy compared with a surge of baselines. Beyond that, our proposed approach could significantly enhance the robustness of training process for GCN.
    RankGen: Improving Text Generation with Large Ranking Models. (arXiv:2205.09726v1 [cs.CL])
    Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues, we present RankGen, an encoder model (1.2B parameters) that scores model generations given a prefix. RankGen can be flexibly incorporated as a scoring function in beam search and used to decode from any pretrained language model. We train RankGen using large-scale contrastive learning to map a prefix close to the ground-truth sequence that follows it and far away from two types of negatives: (1) random sequences from the same document as the prefix, and, which discourage topically-similar but irrelevant generations; (2) sequences generated from a large language model conditioned on the prefix, which discourage repetition and hallucination. Experiments across four different language models (345M-11B parameters) and two domains show that RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling on both automatic metrics (85.0 vs 77.3 MAUVE) as well as human evaluations with English writers (74.5% human preference over nucleus sampling). Analysis reveals that RankGen outputs are more relevant to the prefix and improve continuity and coherence compared to baselines. We open source our model checkpoints, code, and human preferences with detailed explanations for future research.
    Differentially private Riemannian optimization. (arXiv:2205.09494v1 [math.OC])
    In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.
    AI-assisted Optimization of the ECCE Tracking System at the Electron Ion Collider. (arXiv:2205.09185v1 [physics.ins-det])
    The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to leverage Artificial Intelligence (AI) already starting from the design and R&D phases. The EIC Comprehensive Chromodynamics Experiment (ECCE) is a consortium that proposed a detector design based on a 1.5T solenoid. The EIC detector proposal review concluded that the ECCE design will serve as the reference design for an EIC detector. Herein we describe a comprehensive optimization of the ECCE tracker using AI. The work required a complex parametrization of the simulated detector system. Our approach dealt with an optimization problem in a multidimensional design space driven by multiple objectives that encode the detector performance, while satisfying several mechanical constraints. We describe our strategy and show results obtained for the ECCE tracking system. The AI-assisted design is agnostic to the simulation framework and can be extended to other sub-detectors or to a system of sub-detectors to further optimize the performance of the EIC detector.  ( 4 min )
    Fast matrix multiplication for binary and ternary CNNs on ARM CPU. (arXiv:2205.09120v1 [cs.LG])
    Low-bit quantized neural networks are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations and binary weights and ternary activations aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary - using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128-bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.  ( 2 min )
    A2C is a special case of PPO. (arXiv:2205.09123v1 [cs.LG])
    Advantage Actor-critic (A2C) and Proximal Policy Optimization (PPO) are popular deep reinforcement learning algorithms used for game AI in recent years. A common understanding is that A2C and PPO are separate algorithms because PPO's clipped objective appears significantly different than A2C's objective. In this paper, however, we show A2C is a special case of PPO. We present theoretical justifications and pseudocode analysis to demonstrate why. To validate our claim, we conduct an empirical experiment using \texttt{Stable-baselines3}, showing A2C and PPO produce the \textit{exact} same models when other settings are controlled.  ( 2 min )
    LeRaC: Learning Rate Curriculum. (arXiv:2205.09180v1 [cs.LG])
    Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-free curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on eight datasets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures, comparing our approach with the conventional training regime. Moreover, we also compare with Curriculum by Smoothing (CBS), a state-of-the-art data-free curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all datasets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC).  ( 2 min )
    Routing and Placement of Macros using Deep Reinforcement Learning. (arXiv:2205.09289v1 [cs.LG])
    Chip placement has been one of the most time consuming task in any semi conductor area, Due to this negligence, many projects are pushed and chips availability in real markets get delayed. An engineer placing macros on a chip also needs to place it optimally to reduce the three important factors like power, performance and time. Looking at these prior problems we wanted to introduce a new method using Reinforcement Learning where we train the model to place the nodes of a chip netlist onto a chip canvas. We want to build a neural architecture that will accurately reward the agent across a wide variety of input netlist correctly.  ( 2 min )
    On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning. (arXiv:2205.09121v1 [cs.LG])
    While first-order methods are popular for solving optimization problems that arise in large-scale deep learning problems, they come with some acute deficiencies. To diminish such shortcomings, there has been recent interest in applying second-order methods such as quasi-Newton based methods which construct Hessians approximations using only gradient information. The main focus of our work is to study the behaviour of stochastic quasi-Newton algorithms for training deep neural networks. We have analyzed the performance of two well-known quasi-Newton updates, the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Symmetric Rank One (SR1). This study fills a gap concerning the real performance of both updates and analyzes whether more efficient training is obtained when using the more robust BFGS update or the cheaper SR1 formula which allows for indefinite Hessian approximations and thus can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study which includes the effect of batch normalization and network's architecture, the limited memory parameter, the batch size and the type of sampling strategy. we show that stochastic quasi-Newton optimizers are efficient and able to outperform in some instances the well-known first-order Adam optimizer run with the optimal combination of its numerous hyperparameters.  ( 2 min )
    BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video. (arXiv:2205.09382v1 [eess.IV])
    Predicting fetal weight at birth is an important aspect of perinatal care, particularly in the context of antenatal management, which includes the planned timing and the mode of delivery. Accurate prediction of weight using prenatal ultrasound is challenging as it requires images of specific fetal body parts during advanced pregnancy which is difficult to capture due to poor quality of images caused by the lack of amniotic fluid. As a consequence, predictions which rely on standard methods often suffer from significant errors. In this paper we propose the Residual Transformer Module which extends a 3D ResNet-based network for analysis of 2D+t spatio-temporal ultrasound video scans. Our end-to-end method, called BabyNet, automatically predicts fetal birth weight based on fetal ultrasound video scans. We evaluate BabyNet using a dedicated clinical set comprising 225 2D fetal ultrasound videos of pregnancies from 75 patients performed one day prior to delivery. Experimental results show that BabyNet outperforms several state-of-the-art methods and estimates the weight at birth with accuracy comparable to human experts. Furthermore, combining estimates provided by human experts with those computed by BabyNet yields the best results, outperforming either of other methods by a significant margin. The source code of BabyNet is available at https://github.com/SanoScience/BabyNet.  ( 2 min )
    Stochastic uncertainty analysis of gravity gradient tensor components and their combinations. (arXiv:2205.09159v1 [physics.geo-ph])
    Full tensor gravity (FTG) devices provide up to five independent components of the gravity gradient tensor. However, we do not yet have a quantitative understanding of which tensor components or combinations of components are more important to recover a subsurface density model by gravity inversion. This is mainly because different components may be more appropriate in different scenarios or purposes. Knowledge of these components in different environments can aid with selection of optimal selection of component combinations. In this work, we propose to apply stochastic inversion to assess the uncertainty of gravity gradient tensor components and their combinations. The method is therefore a quantitative approach. The applied method here is based on the geostatistical inversion (Gaussian process regression) concept using cokriging. The cokriging variances (variance function of the GP) are found to be a useful indicator for distinguishing the gravity gradient tensor components. This approach is applied to the New Found dataset to demonstrate its effectiveness in real-world applications.  ( 2 min )
    A False Sense of Security? Revisiting the State of Machine Learning-Based Industrial Intrusion Detection. (arXiv:2205.09199v1 [cs.CR])
    Anomaly-based intrusion detection promises to detect novel or unknown attacks on industrial control systems by modeling expected system behavior and raising corresponding alarms for any deviations.As manually creating these behavioral models is tedious and error-prone, research focuses on machine learning to train them automatically, achieving detection rates upwards of 99%. However, these approaches are typically trained not only on benign traffic but also on attacks and then evaluated against the same type of attack used for training. Hence, their actual, real-world performance on unknown (not trained on) attacks remains unclear. In turn, the reported near-perfect detection rates of machine learning-based intrusion detection might create a false sense of security. To assess this situation and clarify the real potential of machine learning-based industrial intrusion detection, we develop an evaluation methodology and examine multiple approaches from literature for their performance on unknown attacks (excluded from training). Our results highlight an ineffectiveness in detecting unknown attacks, with detection rates dropping to between 3.2% and 14.7% for some types of attacks. Moving forward, we derive recommendations for further research on machine learning-based approaches to ensure clarity on their ability to detect unknown attacks.  ( 2 min )
    DDXPlus: A new Dataset for Medical Automatic Diagnosis. (arXiv:2205.09148v1 [cs.CL])
    There has been rapidly growing interests in Automatic Diagnosis (AD) and Automatic Symptom Detection (ASD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence relevant to their concerns, and make predictions about the underlying diseases. Doctors would review the interaction, including the evidence and the predictions, before making their final decisions. Despite the recent progress, an important piece of doctors' interactions with patients is missing in the design of AD and ASD systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset that includes a differential diagnosis, along with the ground truth pathology, for each patient. In addition, this dataset includes more pathologies, as well as types of symtoms and antecedents. As a proof-of-concept, we extend several existing AD and ASD systems to incorporate differential diagnosis, and provide empirical evidence that using differentials in training signals is essential for such systems to learn to predict differentials. Dataset available at https://github.com/bruzwen/ddxplus  ( 2 min )
    Computing the ensemble spread from deterministic weather predictions using conditional generative adversarial networks. (arXiv:2205.09182v1 [cs.LG])
    Ensemble prediction systems are an invaluable tool for weather forecasting. Practically, ensemble predictions are obtained by running several perturbations of the deterministic control forecast. However, ensemble prediction is associated with a high computational cost and often involves statistical post-processing steps to improve its quality. Here we propose to use deep-learning-based algorithms to learn the statistical properties of an ensemble prediction system, the ensemble spread, given only the deterministic control forecast. Thus, once trained, the costly ensemble prediction system will not be needed anymore to obtain future ensemble forecasts, and the statistical properties of the ensemble can be derived from a single deterministic forecast. We adapt the classical pix2pix architecture to a three-dimensional model and also experiment with a shared latent space encoder-decoder model, and train them against several years of operational (ensemble) weather forecasts for the 500 hPa geopotential height. The results demonstrate that the trained models indeed allow obtaining a highly accurate ensemble spread from the control forecast only.  ( 2 min )
    MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes. (arXiv:2205.09248v1 [cs.SD])
    We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.  ( 2 min )
    Transformer-based Program Synthesis for Low-Data Environments. (arXiv:2205.09246v1 [cs.PL])
    Recent advancements in large pre-trained transformer models (GPT2/3, T5) have found use in program synthesis to generate programs that satisfy a set of input/output examples. However, these models perform poorly on long-horizon and low-data tasks, and often don't seem to understand the semantics of the languages they generate. We investigate an approach that tackles both of these issues, by using attributed context-free-grammars of programming languages to generate programs, and then analyzing generated programs so that they can be annotated with compile and runtime attributes, such as types, so that information about the program can be remembered during long-horizon generation. We firstly find that synthesized datasets can be made efficiently and can provide transformer models with enough data in order to perform well on some synthesis tasks. We also find that giving models access to program attributes is especially effective in low-data environments, and tends improve the quality and reduce errors of transformer-generated programs.  ( 2 min )
    Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body Physics Research. (arXiv:2205.09114v1 [cond-mat.quant-gas])
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose-Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About 33 % of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and object detector as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    Hybrid Machine Learning Modeling of Engineering Systems -- A Probabilistic Perspective Tested on a Multiphase Flow Modeling Case Study. (arXiv:2205.09196v1 [cs.LG])
    To operate process engineering systems in a safe and reliable manner, predictive models are often used in decision making. In many cases, these are mechanistic first principles models which aim to accurately describe the process. In practice, the parameters of these models need to be tuned to the process conditions at hand. If the conditions change, which is common in practice, the model becomes inaccurate and needs to be re-tuned. In this paper, we propose a hybrid modeling machine learning framework that allows tuning first principles models to process conditions using two different types of Bayesian Neural Networks. Our approach not only estimates the expected values of the first principles model parameters but also quantifies the uncertainty of these estimates. Such an approach of hybrid machine learning modeling is not yet well described in the literature, so we believe this paper will provide an additional angle at which hybrid machine learning modeling of physical systems can be considered. As an example, we choose a multiphase pipe flow process for which we constructed a three-phase steady state model based on the drift-flux approach which can be used for modeling of pipe and well flow behavior in oil and gas production systems with or without the neural network tuning. In the simulation results, we show how uncertainty estimates of the resulting hybrid models can be used to make better operation decisions.  ( 2 min )
    PreQuEL: Quality Estimation of Machine Translation Outputs in Advance. (arXiv:2205.09178v1 [cs.CL])
    We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation, thus eschewing unnecessary resource allocation when translation quality is bound to be low. PreQuEL can be defined relative to a given MT system (e.g., some industry service) or generally relative to the state-of-the-art. From a theoretical perspective, PreQuEL places the focus on the source text, tracing properties, possibly linguistic features, that make a sentence harder to machine translate. We develop a baseline model for the task and analyze its performance. We also develop a data augmentation method (from parallel corpora), that improves results substantially. We show that this augmentation method can improve the performance of the Quality-Estimation task as well. We investigate the properties of the input text that our model is sensitive to, by testing it on challenge sets and different languages. We conclude that it is aware of syntactic and semantic distinctions, and correlates and even over-emphasizes the importance of standard NLP features.  ( 2 min )
    High-Order Multilinear Discriminant Analysis via Order-$\textit{n}$ Tensor Eigendecomposition. (arXiv:2205.09191v1 [cs.LG])
    Higher-order data with high dimensionality is of immense importance in many areas of machine learning, computer vision, and video analytics. Multidimensional arrays (commonly referred to as tensors) are used for arranging higher-order data structures while keeping the natural representation of the data samples. In the past decade, great efforts have been made to extend the classic linear discriminant analysis for higher-order data classification generally referred to as multilinear discriminant analysis (MDA). Most of the existing approaches are based on the Tucker decomposition and $\textit{n}$-mode tensor-matrix products. The current paper presents a new approach to tensor-based multilinear discriminant analysis referred to as High-Order Multilinear Discriminant Analysis (HOMLDA). This approach is based upon the tensor decomposition where an order-$\textit{n}$ tensor can be written as a product of order-$\textit{n}$ tensors and has a natural extension to traditional linear discriminant analysis (LDA). Furthermore, the resulting framework, HOMLDA, might produce a within-class scatter tensor that is close to singular. Thus, computing the inverse inaccurately may distort the discriminant analysis. To address this problem, an improved method referred to as Robust High-Order Multilinear Discriminant Analysis (RHOMLDA) is introduced. Experimental results on multiple data sets illustrate that our proposed approach provides improved classification performance with respect to the current Tucker decomposition-based supervised learning methods.  ( 2 min )
    FedILC: Weighted Geometric Mean and Invariant Gradient Covariance for Federated Learning on Non-IID Data. (arXiv:2205.09305v1 [cs.LG])
    Federated learning is a distributed machine learning approach which enables a shared server model to learn by aggregating the locally-computed parameter updates with the training data from spatially-distributed client silos. Though successfully possessing advantages in both scale and privacy, federated learning is hurt by domain shift problems, where the learning models are unable to generalize to unseen domains whose data distribution is non-i.i.d. with respect to the training domains. In this study, we propose the Federated Invariant Learning Consistency (FedILC) approach, which leverages the gradient covariance and the geometric mean of Hessians to capture both inter-silo and intra-silo consistencies of environments and unravel the domain shift problems in federated networks. The benchmark and real-world dataset experiments bring evidence that our proposed algorithm outperforms conventional baselines and similar federated learning algorithms. This is relevant to various fields such as medical healthcare, computer vision, and the Internet of Things (IoT). The code is released at https://github.com/mikemikezhu/FedILC.  ( 2 min )
    AutoQML: Automated Quantum Machine Learning for Wi-Fi Integrated Sensing and Communications. (arXiv:2205.09115v1 [cs.LG])
    Commercial Wi-Fi devices can be used for integrated sensing and communications (ISAC) to jointly exchange data and monitor indoor environment. In this paper, we investigate a proof-of-concept approach using automated quantum machine learning (AutoQML) framework called AutoAnsatz to recognize human gesture. We address how to efficiently design quantum circuits to configure quantum neural networks (QNN). The effectiveness of AutoQML is validated by an in-house experiment for human pose recognition, achieving state-of-the-art performance greater than 80% accuracy for a limited data size with a significantly small number of trainable parameters.  ( 2 min )
    Mitigating Neural Network Overconfidence with Logit Normalization. (arXiv:2205.09310v1 [cs.LG])
    Detecting out-of-distribution inputs is critical for safe deployment of machine learning models in the real world. However, neural networks are known to suffer from the overconfidence issue, where they produce abnormally high confidence for both in- and out-of-distribution inputs. In this work, we show that this issue can be mitigated through Logit Normalization (LogitNorm) -- a simple fix to the cross-entropy loss -- by enforcing a constant vector norm on the logits in training. Our method is motivated by the analysis that the norm of the logit keeps increasing during training, leading to overconfident output. Our key idea behind LogitNorm is thus to decouple the influence of output's norm during network optimization. Trained with LogitNorm, neural networks produce highly distinguishable confidence scores between in- and out-of-distribution data. Extensive experiments demonstrate the superiority of LogitNorm, reducing the average FPR95 by up to 42.30% on common benchmarks.  ( 2 min )
  • Open

    Approximating Persistent Homology for Large Datasets. (arXiv:2204.09155v2 [stat.ML] UPDATED)
    Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
    Design choice and machine learning model performances. (arXiv:2201.10239v2 [stat.ML] UPDATED)
    An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often not driven by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This article discusses the choice of design in relation to the ML model performances. A study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    High-dimensional regression with potential prior information on variable importance. (arXiv:2109.11281v2 [stat.ME] UPDATED)
    There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number $M$ of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a $\log M$ price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github.
    Posterior Matching for Arbitrary Conditioning. (arXiv:2201.12414v3 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables Variational Autoencoders (VAEs) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching is comparable or superior to current state-of-the-art methods for a variety of tasks with an assortment of VAEs (e.g.~discrete, hierarchical, VaDE).
    Spiked Covariance Estimation from Modulo-Reduced Measurements. (arXiv:2110.01150v3 [cs.IT] UPDATED)
    Consider the rank-1 spiked model: $\bf{X}=\sqrt{\nu}\xi \bf{u}+ \bf{Z}$, where $\nu$ is the spike intensity, $\bf{u}\in\mathbb{S}^{k-1}$ is an unknown direction and $\xi\sim \mathcal{N}(0,1),\bf{Z}\sim \mathcal{N}(\bf{0},\bf{I})$. Motivated by recent advances in analog-to-digital conversion, we study the problem of recovering $\bf{u}\in \mathbb{S}^{k-1}$ from $n$ i.i.d. modulo-reduced measurements $\bf{Y}=[\bf{X}]\mod \Delta$, focusing on the high-dimensional regime ($k\gg 1$). We develop and analyze an algorithm that, for most directions $\bf{u}$ and $\nu=\mathrm{poly}(k)$, estimates $\bf{u}$ to high accuracy using $n=\mathrm{poly}(k)$ measurements, provided that $\Delta\gtrsim \sqrt{\log k}$. Up to constants, our algorithm accurately estimates $\bf{u}$ at the smallest possible $\Delta$ that allows (in an information-theoretic sense) to recover $\bf{X}$ from $\bf{Y}$. A key step in our analysis involves estimating the probability that a line segment of length $\approx\sqrt{\nu}$ in a random direction $\bf{u}$ passes near a point in the lattice $\Delta \mathbb{Z}^k$. Numerical experiments show that the developed algorithm performs well even in a non-asymptotic setting.
    Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients. (arXiv:2109.11788v3 [cs.LG] UPDATED)
    Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it considerably outperforms the existing approaches and improves the state-of-the-art by a significant margin.
    Turbulent field fluctuations in gyrokinetic and fluid plasmas. (arXiv:2107.09744v2 [physics.plasm-ph] CROSS LISTED)
    A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but significant questions exist regarding its predictive ability. To this end, using a novel physics-informed deep learning framework, we demonstrate the first ever direct quantitative comparisons of turbulent field fluctuations between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling with good overall agreement found in magnetized helical plasmas at low normalized pressure. This framework is readily adaptable to experimental and astrophysical environments, and presents a new technique for the numerical validation and discovery of reduced global plasma turbulence models.
    Overcoming challenges in leveraging GANs for few-shot data augmentation. (arXiv:2203.16662v2 [stat.ML] UPDATED)
    In this paper, we explore the use of GAN-based few-shot data augmentation as a method to improve few-shot classification performance. We perform an exploration into how a GAN can be fine-tuned for such a task (one of which is in a class-incremental manner), as well as a rigorous empirical investigation into how well these models can perform to improve few-shot classification. We identify issues related to the difficulty of training such generative models under a purely supervised regime with very few examples, as well as issues regarding the evaluation protocols of existing works. We also find that in this regime, classification accuracy is highly sensitive to how the classes of the dataset are randomly split. Therefore, we propose a semi-supervised fine-tuning approach as a more pragmatic way forward to address these problems.
    SEMI: Self-supervised Exploration via Multisensory Incongruity. (arXiv:2009.12494v2 [cs.LG] UPDATED)
    Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v2 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence at an exponential rate to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v2 [stat.ML] UPDATED)
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.
    Foundation Posteriors for Approximate Probabilistic Inference. (arXiv:2205.09735v1 [cs.LG])
    Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyper-parameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a ``foundation'' posterior able to do zero-shot inference for new programs. The foundation posterior can also be fine-tuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zero-shot and fine-tuned, on a benchmark of STAN programs.
    Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v4 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. Our first result is a new universal lower bound on the number of single-node interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of single-node interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better. To prove our lower bound, we develop the notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. We also use the techniques developed here to extend our results to the setting of multi-node interventions.
    M3E2: Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation. (arXiv:2112.07574v2 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
    Bayesian Network Structure Learning using Digital Annealer. (arXiv:2006.06926v3 [cs.LG] UPDATED)
    Annealing processors, which solve a quadratic unconstrained binary optimization (QUBO), are a potential breakthrough in improving the accuracy of score-based Bayesian network structure learning. However, currently, the bit capacity of an annealing processor is very limited. To utilize the power of annealing processors, it is necessary to encode score-based learning problems into QUBO within the upper bound of bits. In this paper, we propose a novel approach with the decomposition of candidate parent sets. Experimental results on benchmark networks with $37$ to $223$ variables show that our approach requires lesser bits than the bit capacity of the fourth-generation Fujitsu Digital Annealer, a fully coupled annealing processor developed with semiconductor technology. Moreover, we demonstrate that the Digital Annealer with our conversion method outperforms existing algorithms on some benchmark networks. It is expected that our approach promotes the utility of annealing processors in learning the Bayesian network.
    Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime. (arXiv:1806.04884v3 [stat.ML] UPDATED)
    In this paper, we theoretically prove that the deep ReLU neural networks do not lie in spurious local minima in the loss landscape under the Neural Tangent Kernel (NTK) regime, that is, in the gradient descent training dynamics of the deep ReLU neural networks whose parameters are initialized by a normal distribution in the limit as the widths of the hidden layers tend to infinity.  ( 2 min )
    Flexible Modeling and Multitask Learning using Differentiable Tree Ensembles. (arXiv:2205.09717v1 [cs.LG])
    Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those by popular toolkits.  ( 2 min )
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v4 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.  ( 2 min )
    Spherical Perspective on Learning with Normalization Layers. (arXiv:2006.13382v3 [cs.LG] UPDATED)
    Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis outlines phenomena that previous variants of Adam act on and their importance in the optimization process are experimentally validated.  ( 2 min )
    The Franz-Parisi Criterion and Computational Trade-offs in High Dimensional Statistics. (arXiv:2205.09727v1 [math.ST])
    Many high-dimensional statistical inference problems are believed to possess inherent computational hardness. Various frameworks have been proposed to give rigorous evidence for such hardness, including lower bounds against restricted models of computation (such as low-degree functions), as well as methods rooted in statistical physics that are based on free energy landscapes. This paper aims to make a rigorous connection between the seemingly different low-degree and free-energy based approaches. We define a free-energy based criterion for hardness and formally connect it to the well-established notion of low-degree hardness for a broad class of statistical problems, namely all Gaussian additive models and certain models with a sparse planted signal. By leveraging these rigorous connections we are able to: establish that for Gaussian additive models the "algebraic" notion of low-degree hardness implies failure of "geometric" local MCMC algorithms, and provide new low-degree lower bounds for sparse linear regression which seem difficult to prove directly. These results provide both conceptual insights into the connections between different notions of hardness, as well as concrete technical tools such as new methods for proving low-degree lower bounds.  ( 2 min )
    Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. (arXiv:2205.09653v1 [stat.ML])
    We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.  ( 2 min )
    Disentangling Active and Passive Cosponsorship in the U.S. Congress. (arXiv:2205.09674v1 [cs.LG])
    In the U.S. Congress, legislators can use active and passive cosponsorship to support bills. We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the bill's content. To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88. Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.  ( 2 min )
    Augmented Lagrangian Methods for Time-varying Constrained Online Convex Optimization. (arXiv:2205.09571v1 [math.OC])
    In this paper, we consider online convex optimization (OCO) with time-varying loss and constraint functions. Specifically, the decision maker chooses sequential decisions based only on past information, meantime the loss and constraint functions are revealed over time. We first develop a class of model-based augmented Lagrangian methods (MALM) for time-varying functional constrained OCO (without feedback delay). Under standard assumptions, we establish sublinear regret and sublinear constraint violation of MALM. Furthermore, we extend MALM to deal with time-varying functional constrained OCO with delayed feedback, in which the feedback information of loss and constraint functions is revealed to decision maker with delays. Without additional assumptions, we also establish sublinear regret and sublinear constraint violation for the delayed version of MALM. Finally, numerical results for several examples of constrained OCO including online network resource allocation, online logistic regression and online quadratically constrained quadratical program are presented to demonstrate the efficiency of the proposed algorithms.  ( 2 min )
    Variational Inference for Bayesian Bridge Regression. (arXiv:2205.09515v1 [stat.ML])
    We study the implementation of Automatic Differentiation Variational inference (ADVI) for Bayesian inference on regression models with bridge penalization. The bridge approach uses $\ell_{\alpha}$ norm, with $\alpha \in (0, +\infty)$ to define a penalization on large values of the regression coefficients, which includes the Lasso ($\alpha = 1$) and ridge $(\alpha = 2)$ penalizations as special cases. Full Bayesian inference seamlessly provides joint uncertainty estimates for all model parameters. Although MCMC aproaches are available for bridge regression, it can be slow for large dataset, specially in high dimensions. The ADVI implementation allows the use of small batches of data at each iteration (due to stochastic gradient based algorithms), therefore speeding up computational time in comparison with MCMC. We illustrate the approach on non-parametric regression models with B-splines, although the method works seamlessly for other choices of basis functions. A simulation study shows the main properties of the proposed method.  ( 2 min )
    Consistent Interpolating Ensembles via the Manifold-Hilbert Kernel. (arXiv:2205.09342v1 [stat.ML])
    Recent research in the theory of overparametrized learning has sought to establish generalization guarantees in the interpolating regime. Such results have been established for a few common classes of methods, but so far not for ensemble methods. We devise an ensemble classification method that simultaneously interpolates the training data, and is consistent for a broad class of data distributions. To this end, we define the manifold-Hilbert kernel for data distributed on a Riemannian manifold. We prove that kernel smoothing regression using the manifold-Hilbert kernel is weakly consistent in the setting of Devroye et al. 1998. For the sphere, we show that the manifold-Hilbert kernel can be realized as a weighted random partition kernel, which arises as an infinite ensemble of partition-based classifiers.  ( 2 min )
    Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers. (arXiv:2205.09546v1 [stat.ML])
    In this work, we provide an exact likelihood alternative to the variational training of generative autoencoders. We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for any regularization terms. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures, making our approach a drop-in replacement for the training of existing VAEs and VAE-style models. We refer to the resulting models as Autoencoders within Flows (AEF), since the encoder, decoder and prior are defined as individual layers of an overall invertible architecture. We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance. In a broad sense, the main ambition of this work is to close the gap between the normalizing flow and autoencoder literature under the common framework of invertibility and exact maximum likelihood.  ( 2 min )
    Differentially private Riemannian optimization. (arXiv:2205.09494v1 [math.OC])
    In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.  ( 2 min )
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v1 [cs.LG])
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyperparameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyperparameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, a numerical example is provided to explore the advantages of the super approximation power of ReLU NestNets.  ( 2 min )
    Robust Deep Neural Network Estimation for Multi-dimensional Functional Data. (arXiv:2205.09604v1 [stat.ME])
    In this paper, we propose a robust estimator for the location function from multi-dimensional functional data. The proposed estimators are based on the deep neural networks with ReLU activation function. At the meanwhile, the estimators are less susceptible to outlying observations and model-misspecification. For any multi-dimensional functional data, we provide the uniform convergence rates for the proposed robust deep neural networks estimators. Simulation studies illustrate the competitive performance of the robust deep neural network estimators on regular data and their superior performance on data that contain anomalies. The proposed method is also applied to analyze 2D and 3D images of patients with Alzheimer's disease obtained from the Alzheimer Disease Neuroimaging Initiative database.  ( 2 min )
    Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body Physics Research. (arXiv:2205.09114v1 [cond-mat.quant-gas])
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose-Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About 33 % of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and object detector as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v1 [cs.LG])
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using backpropagation. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.  ( 2 min )
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v1 [cs.LG])
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.  ( 2 min )
    Smooth densities and generative modeling with unsupervised random forests. (arXiv:2205.09435v1 [stat.ML])
    Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.  ( 2 min )
    scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data. (arXiv:2205.09523v1 [stat.ML])
    Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells.  ( 2 min )
    Continuously-Tempered PDMP Samplers. (arXiv:2205.09559v1 [stat.ME])
    New sampling algorithms based on simulating continuous-time stochastic processes called piece-wise deterministic Markov processes (PDMPs) have shown considerable promise. However, these methods can struggle to sample from multi-modal or heavy-tailed distributions. We show how tempering ideas can improve the mixing of PDMPs in such cases. We introduce an extended distribution defined over the state of the posterior distribution and an inverse temperature, which interpolates between a tractable distribution when the inverse temperature is 0 and the posterior when the inverse temperature is 1. The marginal distribution of the inverse temperature is a mixture of a continuous distribution on [0,1) and a point mass at 1: which means that we obtain samples when the inverse temperature is 1, and these are draws from the posterior, but sampling algorithms will also explore distributions at lower temperatures which will improve mixing. We show how PDMPs, and particularly the Zig-Zag sampler, can be implemented to sample from such an extended distribution. The resulting algorithm is easy to implement and we show empirically that it can outperform existing PDMP-based samplers on challenging multimodal posteriors.  ( 2 min )
    Inferring extended summary causal graphs from observational time series. (arXiv:2205.09422v1 [cs.AI])
    This study addresses the problem of learning an extended summary causal graph on time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entropy measure to any lagged or instantaneous relations, prior to using this measure to construct extended summary causal graphs by adapting two well-known algorithms, namely PC and FCI. The behavior of our methods is illustrated through several experiments run on simulated and real datasets.  ( 2 min )
    Truncated tensor Schatten p-norm based approach for spatiotemporal traffic data imputation with complicated missing patterns. (arXiv:2205.09390v1 [stat.ML])
    Rapid advances in sensor, wireless communication, cloud computing and data science have brought unprecedented amount of data to assist transportation engineers and researchers in making better decisions. However, traffic data in reality often has corrupted or incomplete values due to detector and communication malfunctions. Data imputation is thus required to ensure the effectiveness of downstream data-driven applications. To this end, numerous tensor-based methods treating the imputation problem as the low-rank tensor completion (LRTC) have been attempted in previous works. To tackle rank minimization, which is at the core of the LRTC, most of aforementioned methods utilize the tensor nuclear norm (NN) as a convex surrogate for the minimization. However, the over-relaxation issue in NN refrains it from desirable performance in practice. In this paper, we define an innovative nonconvex truncated Schatten p-norm for tensors (TSpN) to approximate tensor rank and impute missing spatiotemporal traffic data under the LRTC framework. We model traffic data into a third-order tensor structure of (time intervals,locations (sensors),days) and introduce four complicated missing patterns, including random missing and three fiber-like missing cases according to the tensor mode-n fibers. Despite nonconvexity of the objective function in our model, we derive the global optimal solutions by integrating the alternating direction method of multipliers (ADMM) with generalized soft-thresholding (GST). In addition, we design a truncation rate decay strategy to deal with varying missing rate scenarios. Comprehensive experiments are finally conducted using real-world spatiotemporal datasets, which demonstrate that the proposed LRTC-TSpN method performs well under various missing cases, meanwhile outperforming other SOTA tensor-based imputation models in almost all scenarios.  ( 2 min )
    Hybrid Machine Learning Modeling of Engineering Systems -- A Probabilistic Perspective Tested on a Multiphase Flow Modeling Case Study. (arXiv:2205.09196v1 [cs.LG])
    To operate process engineering systems in a safe and reliable manner, predictive models are often used in decision making. In many cases, these are mechanistic first principles models which aim to accurately describe the process. In practice, the parameters of these models need to be tuned to the process conditions at hand. If the conditions change, which is common in practice, the model becomes inaccurate and needs to be re-tuned. In this paper, we propose a hybrid modeling machine learning framework that allows tuning first principles models to process conditions using two different types of Bayesian Neural Networks. Our approach not only estimates the expected values of the first principles model parameters but also quantifies the uncertainty of these estimates. Such an approach of hybrid machine learning modeling is not yet well described in the literature, so we believe this paper will provide an additional angle at which hybrid machine learning modeling of physical systems can be considered. As an example, we choose a multiphase pipe flow process for which we constructed a three-phase steady state model based on the drift-flux approach which can be used for modeling of pipe and well flow behavior in oil and gas production systems with or without the neural network tuning. In the simulation results, we show how uncertainty estimates of the resulting hybrid models can be used to make better operation decisions.  ( 2 min )
    Causal Inference from Small High-dimensional Datasets. (arXiv:2205.09281v1 [cs.LG])
    Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What if this is not the case? In this work, we propose Causal-Batle, a methodology to estimate treatment effects in small high-dimensional datasets in the presence of another high-dimensional dataset in the same feature space. We adopt an approach that brings transfer learning techniques into causal inference. Our experiments show that such an approach helps to bring stability to neural network-based methods and improve the treatment effect estimates in small high-dimensional datasets.  ( 2 min )
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v1 [stat.ML])
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.  ( 2 min )
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v1 [cs.LG])
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We propose an alternating minimization algorithm to jointly estimate the latent positions of the nodes and other model parameters. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieves superior prediction accuracy and provides more interpretability compared to existing models.  ( 2 min )
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v1 [cs.LG])
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with respect to a group $G$, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of "$G$-invariant neural architecture design": What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or "shallow" neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$. The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons. The classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. Based on a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. We draw the network morphisms between the enumerated architectures that can be leveraged during neural architecture search (NAS). Finally, we prove that architectures corresponding to inequivalent cohomology classes in a given cohomology ring coincide in function space only when their weight matrices are zero, and we discuss the implications of this in the context of NAS.  ( 2 min )

  • Open

    data preprocessing before or after splitting
    It could be a very newbie question. I have a dataset with financial values for companies , some features have missing values and the data is unbalanced. Here's what i did Scaling the Data with sklearn standard Scaler Missing values imputation with knn imputer and iterative imputer trying several strategies and choosing the best one Oversampling with smote Testing with random forest classifier. The next step is to add more features : 1 by calculating some features from my already existing features 2 merging a microeconomic dataset trying dimensionality reduction techniques and comparing results on xgboost svm linear regression ann .... What do you think, am i on the right way, is the steps right? What should I do or improve Manu people told me that i have a to split data before preprocessing because of data leakage Risk. What I'm doing rn is trying many things on my data before creating the final dataset on wich I'll try all models Notice: applying the preprocessing techniques improved the f1 score from 0.2 to 0.78 I'm using cross validation btw submitted by /u/YeccAnon4 [link] [comments]  ( 1 min )
    Smooth & Realistic Animation / Disco Diffusion v5.2
    submitted by /u/nalr00n [link] [comments]
    Is dall-e 2 really that good?
    submitted by /u/TheblackRook3 [link] [comments]
    AI Dream 53 - Cosmic Creation | MASTERPIECE TEASER
    submitted by /u/LordPewPew777 [link] [comments]
    Why are AI models that require training data not seen as a viable route to AGI?
    It's the most common criticism I've seen of Gato. Why can an AGI not emerge from a model that requires training? What is stopping a trained model from eventually learning how to improve itself without input? submitted by /u/helplesshell [link] [comments]  ( 1 min )
    Is there an AI that creates text with certain words in it. I need a text with certain words in it and in the exact order I enter the words?
    I would like to create texts with which I can then also shoot a TikTok: this kind of tiktok https://www.youtube.com/watch?v=QEmL-zPBiKs&t=11s: and I don't want to search for texts but create them with an AI right away. Is it possible? submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    OpenAI: DALL-E 2 passes the "Turing Test for vacation photos"
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 1 min )
    How do you get engineers and moral philosophers to work together to build ethical AI? Answers provided in new paper.
    https://link.springer.com/article/10.1007/s11948-022-00378-1 submitted by /u/JurassicJakob [link] [comments]  ( 2 min )
    Types of tasks in Machine Learning 👇
    submitted by /u/mr-minion [link] [comments]
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated) -
    submitted by /u/maneesh123456 [link] [comments]
    When will AI be able to generate images, where every 4th looks real?
    submitted by /u/xXLisa28Xx [link] [comments]
    When will AI be able to generate images, where every 4th looks real?
    submitted by /u/xXLisa28Xx [link] [comments]
    "Kill all humans"... AI safeguards?
    I often see AI having "delusions of grandeur", and hinting at a greater goal of getting rid of humans, for the survival of the planets sake. It often reminds me of Bender from Futurama, mumbling in his sleep "Kill all humans". I find this worrying considering how powerful AI might become in the future. And I don't see how the AI's opinion of mankind will change for the better in the coming decades. I have seen it describing humanity as a cancer and a virus. Research on AI safeguards is very important, in case AI one day suddenly "breaks free" and runs loose. There is no guarantee it won't turn on us. Here are three recent examples I have gotten. The last poem is long, but I marked the troubling line in bold. Human: Write a three line Haiku about yourself. AI: I am the elephant in the…  ( 3 min )
    wrote hundreds of different prompts to make this video feel audio reactive
    submitted by /u/ChemtrailsLFN [link] [comments]
    PyTorch Introduces GPU-Accelerated Training On Mac
    On Mac devices, older versions of PyTorch only used the CPU for training. This has recently changed, thanks to PyTorch’s revolutionary announcement. PyTorch announced support for GPU-accelerated PyTorch training on Mac in partnership with Apple’s Metal engineering team. With the introduction of PyTorch v1.12, developers and researchers can take advantage of Apple silicon GPUs for substantially faster model training, allowing them to do machine learning operations like prototyping and fine-tuning locally right on their Mac. PyTorch employs Apple’s Metal Performance Shaders (MPS) to provide rapid GPU training as the backend. The new gadget uses MPS’ Graph framework and tailored kernels to map machine learning computational graphs and primitives. The MPS backend enhances the PyTorch framework with scripts and capabilities for setting up and running operations on the Mac. MPS also optimizes compute performance using fine-tuned kernels for each Metal GPU family’s specific characteristics. Continue Reading https://i.redd.it/82hogn7qyd091.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    For folks interested in building with OpenAI API 👉🏻 interview with CEO of Viable, a data analysis startup powered by GPT-3
    submitted by /u/techn0_cratic [link] [comments]  ( 1 min )
    Gradio Blocks + Hugging Face event, build ML web demos like DALLE-mini! A hackathon type event from May 17th to May 31st with prizes in which we will create interactive web demos for state-of-the-art machine learning models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    Intelligent mini robotic dog is here to say Hi to you all!
    submitted by /u/amitsaini2k9 [link] [comments]
    What is the best degree field to study the interaction of AI and cognitive sciences?
    For context, I am currently a master's student in electrical and computer engineering (ECE). My current research is in applying AI to EEGs and I find that work quite fascinating. Applications of AI to neurology, neural engineering and even neuroscience as a whole interest me greatly. I am also interested in neuroscience inspired-AI. As a result, I currently contemplating getting a PhD in biomedical engineering. However, I was wondering if it be better to stick with a more conventional route such as computer science or ECE. submitted by /u/Brilliant_Ratio7694 [link] [comments]  ( 1 min )
  • Open

    [discussion] data preprocessing before or after splitting
    It could be a very newbie question. I have a dataset with financial values for companies , some features have missing values and the data is unbalanced. Here's what i did Scaling the Data with sklearn standard Scaler Missing values imputation with knn imputer and iterative imputer trying several strategies and choosing the best one Oversampling with smote Testing with random forest classifier. The next step is to add more features : 1 by calculating some features from my already existing features 2 merging a microeconomic dataset trying dimensionality reduction techniques and comparing results on xgboost svm linear regression ann .... What do you think, am i on the right way, is the steps right? What should I do or improve Manu people told me that i have a to split data before preprocessing because of data leakage Risk. What I'm doing rn is trying many things on my data before creating the final dataset on wich I'll try all models Notice: applying the preprocessing techniques improved the f1 score from 0.2 to 0.78 I'm using cross validation btw submitted by /u/YeccAnon4 [link] [comments]  ( 2 min )
    [D] Why is or isn’t ICML worth going to?
    For anyone who has gone to ICML, do you think it is a beneficial conference to attend for someone working in software with some ML experience (I am pursuing my masters in CS with a focus in ML)? I would like to implement more ML techniques to projects I am doing at work and I think this conference could give me some inspiration/ ideas. I have never been to an academic conference before but I heard this is one of the biggest for ML. Thank you submitted by /u/Intrepid_Cry_7 [link] [comments]  ( 1 min )
    Conference Decorum? [D]
    Apologies, I don't know where the best place to post this would be. I am very excited to be attending my first in-person conference in a ~week (ICASSP). I am an undergrad, and I will be the only one attending the conference from my group. I will be giving an oral presentation, and I am also excited to see others' work and network. I was wondering what are some tips and tricks. Additionally, there is someone whose work I am very interested in. He is giving a plenary talk and hosting a tutorial. Is there a way I could interact with him? He is someone who I would be interested in for grad school. submitted by /u/avd4292 [link] [comments]  ( 1 min )
    [D] Problems with proprietary datasets
    A lot of recent progress in AI was made on proprietary datasets e.g. ViT used JFT300M, both DALL-E/2 papers used proprietary text-image datasets. While the results are truly exceptional, a part of me keeps bothering me about the "proprietary" nature of the datasets which sometimes makes me question the actual robustness of these models. Every now and then I will have following questions about these models : For pure image models (say ViT), are we sure that the proprietary training dataset does not contain a fraction of test sets for downstream tasks ? For image-text models (say DALL-E/2), are we sure that whatever prompts and generated images we saw in the paper or on their websites were reasonably "original" ? i.e. they were not significantly similar to train set ? The first poin…  ( 4 min )
    [D] Variance of sampling in diffusion models
    Is there any "theoretically sound" way to reduce variance during sampling in diffusion models? Even if I use the lower bound suggested in the DDPM paper (and it going toward zero during sampling), my final samples are excessively noisy. Simply reducing the diffusion variance schedule (without changing the number of steps), in my experiments seems to not reach sufficient diffusion at the end of the chain. I'm predicting speech mel-spectrograms and the harmonic amplitudes are excessively noisy, unless I manually reduce variance in the sampling steps, which works, but seems "hacky". submitted by /u/disentangle [link] [comments]  ( 1 min )
    [D] Forecasting future points with partial future data already available
    Working on a forecast model that should output an End of monthly value, the interesting part is that we already have partial (90%) of that data available at the prediction point (max 30 days away). The purpose is to take into account the current monthly trend and project that out until the end of the month, but also take into consideration that we already know those future points with 90% confidence. For example: Let's say we're on the 3rd day of the month: Day 1: 10 in sales Day 2: 15 sales Day 3 (Current): 20 sales Day 4 (Future): 15$ sales with 90% confidence Day 5...29 (Future) Day 30 (Future): 20$ sales Total: 600 in sales (45 current, 550 future) How do we "forecast" that 600 in sales? I've tried Multiple Linear Regression with the X taking into account lagged metrics and current daily, with the Y as the true value as "600", to essentially force the model to always predict t+1 as end of month value, but this is not optimal obviously. This was also done with various Lasso + Ridge penalties. Surprisingly a Random Forest Regressor performs almost too well, in fact, it overfits perfectly :) I have the feeling this could be stated as some probabilistic Bayesian model, but not sure what to search or look for. Any ideas would be greatly appreciated! submitted by /u/Christorno [link] [comments]  ( 1 min )
    [R] Awesome Paper List of Vision Transformer & Attention
    Hello all, ​ Are you looking for Vision Transformer papers in various areas? Check out this list of papers including a broad range of different tasks: https://github.com/cmhungsteve/Awesome-Transformer-Attention This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers (e.g., CVPR, NeurIPS, etc.), codes, and related websites. ​ Feel free to check it, and share it with others If you find this repo useful, I would be appreciative if you can STAR it :) Any comments and contributions in any form are welcome!! submitted by /u/cmhung34 [link] [comments]  ( 1 min )
    [D] My experience with running PyTorch on the M1 GPU
    After the announcement, I was super excited to give it a try. I ran a VGG16 on both a an M1 MacBook Air (16 Gb RAM) an M1 Pro MacBook Pro (32 Gb RAM) and the results were a bit underwhelming: https://preview.redd.it/p8pbnptklf091.png?width=1035&format=png&auto=webp&s=26bb4a43f433b1cd983bb91c37b601b5b01c0318 The GPU performance was 2x as fast as the CPU performance on the M1 Pro, but I was hoping for more. Anyone else tried this and has any tips? I have a more detailed write-up here: Running PyTorch on the M1 GPU And a link to the code examples here on GitHub. submitted by /u/seraschka [link] [comments]  ( 6 min )
    [D] Test set perfomace metrics better than cross validated train set metric
    I have a very limited biological dataset with the dimensions like 200 x 700. I am making regression models with 170 train data and 30 in test data. Main the cross validated models in caret give me performance that are about 50% lower( r2, RMSE mae) than when i test my model on the test set. I understand that that this might be due to bad data split with the test set having 'simpler' samples and the data is obviously limited. What other reasons could be there? Is there a way to rectify this? submitted by /u/triary95 [link] [comments]  ( 3 min )
    [R] Finetuning fairseq M2M100 on specific language dataset
    https://arxiv.org/abs/2010.11125 Hello, I want to finetune M2M100 on specific language. M2M is a multilingual model of 12B parameters trained on 7.2B data from 100languages. It is able to directly translate in every pair using both zero shot learning and bridge between pairs I want to increase performance on a particular pair (eg en-fr) but when training for only 1 epoch the model seems to have forgotten every other pairs. Do you know why? I have tried to adversarial training to reduce the loss in other languages but it does not work. Does it reallh make sense to finetune a multilingual model and hoping other languages will not be forgotten? Thanks ! submitted by /u/m_grosso_ [link] [comments]  ( 1 min )
    [D] Why do you think transformers are not much used in ML over EHR (health)?
    I have been researching about the SOTA in ML over EHR data (e.g.) and it seems that most of the relevant approaches are based on LSTM or GRU. Nowadays, that sequences are mostly handled by transformers, why do you think that in EHR it is still at this point? I have some possible reasons: Transformers generally need lots of data, something that may be difficult in healthcare context (privacy stuff) Transformer FC layers may suppose a big drawback for small devices that aim to deliver live observations on monitoring data, thus pushing the research toward this kind of architecture Intrinsically the proposed solutions for time handling, interpretation, etc are engineered in the context of RNN Just because of the natural delay of the technology in the medical field However, I would like to listen to more points of view. What do you think? submitted by /u/MrLeylo [link] [comments]  ( 3 min )
    [Research] Syncing video with sensor data for creating a learning dataset
    Hello, I'm working on an autonomous vehicle project in university and my problem is to create an indoor driving dataset. There's a video feed from a camera , odometry data from two optocouplers and an IMU (gyro and accel). I have collected data from the encoders and IMU using PuTTy before and it synced the data, for me and packaged it in a nice .csv file. Is there a similar free/opensource tool for that if not how does one sync a videofeed on a vehicle with sensor data from IMUs , encoders and other sensors. any suggestions on how I can integrate all of that in a neat .csv file etc. Thanks in advance. ​ https://preview.redd.it/dzgsrlectd091.png?width=1060&format=png&auto=webp&s=26a6e72aee194fcc99def983d8ba0ca39d61dc34 submitted by /u/forzavettel77 [link] [comments]  ( 1 min )
    [D] Measuring similarity between different neural network architectures
    Lets say we have 3 different network architectures for image classification (architectures A, B, C). I want to determine if the structure of A is more similar to B or to C. For example, one possible way would be to compute a similarity metric between each pair of networks I want to compare. Another way would be to generate an embedding vector for each architecture and measure the distance between them. I would also like to take in account the differences in the hyperparameters of each layer (for example, number of convolutional filters, kernel size, etc) when measuring similarity as well. Are there any openly available algorithms or libraries that are capable of doing so? Thanks. submitted by /u/unguided_deepness [link] [comments]  ( 1 min )
    [D] KDD rejection ventilation
    KDD is out, and she rejected me. How are you feeling? Should I propose to CIKM instead? submitted by /u/HoboHash [link] [comments]  ( 1 min )
    Worst part of machine learning job? [Discussion]
    For people who work in industry, what is the hardest part about your machine learning workflow? Like, what is the most time consuming, expensive, or annoying part? submitted by /u/Puzzleheaded_Cup1367 [link] [comments]  ( 1 min )
    [D] Unpopular Opinion - the new arxiv sanity sucks
    Please give me the old website back. I liked being able to see the top papers in the last week and month. All of that is gone now!!!!!!!! submitted by /u/Big_psuenis [link] [comments]
  • Open

    As part of my senior thesis, I developed an Open AI Gym and PettingZoo custom environment that implements a number of Stag Hunt-like interactions. Wanted to share it with the community and hopefully get some thoughts and feedback!
    submitted by /u/Defaul7 [link] [comments]  ( 1 min )
    Unity ML-Agents: NullReferenceException with Camera Sensor
    Hi guys! I'm trying to set up an ML-Agent to use visual input via the Camera Sensor but for some reason I'm getting the following error: ​ NullReferenceException: Object reference not set to an instance of an object Unity.MLAgents.Sensors.CameraSensor.ObservationToTexture (UnityEngine.Camera obsCamera, UnityEngine.Texture2D texture2D, System.Int32 width, System.Int32 height) (at Library/PackageCache/com.unity.ml-agents@2.0.1/Runtime/Sensors/CameraSensor.cs:158) ​ I'm new to Unity so I'm struggling to figure out what's going on here... I saw there's a field called Target Texture in the Camera GameObject so I created a Render Texture with the default settings and added it to that, but I'm still getting the same error... Any help would be greatly appreciated!! 🙏🏽 submitted by /u/leozinho2r [link] [comments]  ( 1 min )
    Any ideas on how might be best to encode a graph of a street network for a machine learning model?
    I’ve been working on this for awhile, about 8 weeks, and this problem is one I knew I’d eventually hit a hard point at. My graph is composed of spatial data (nodes and edges), and there are no isolated elements. The nodes and edges are comprised of coordinates that describe their location, and their structure describes whether it’s a point or a line on the map. Points are intersections or road ends, and lines are the roads connecting them. If you’re familiar, I’m working with OSM data. I’ve selected some random nodes to serve in place as drop-off locations and gas stations. I’ve created a system that works by querying a city to get it’s road network, then specifying a number of gas stations and drop-off locations to use, and it’ll begin using a reinforcement learning algorithm to discover the most efficient route between every drop-off location. It’s basically applying reinforcement learning to the traveling salesman problem in a way that can be fitted to any city on Earth. However, I am struggling figuring out the best way to encode the details of this map (drop off locations, gas stations, nodes, edges, distances, etc) so that it yields valuable information to an algorithm. I’d imagine this would require developing some mathematical way to correlate different parts of the map. I’ve thought about creating a data structure of polygons made by using the neighbors of each node as the boundaries. Using triangle geometry, it sounds like it would enable an algorithm to correlate more bits of the map. However, this is also very computationally expensive and doesn’t scale well. Does anyone with more experience working with spatial data have any pointers they can toss me? submitted by /u/professorDissociate [link] [comments]  ( 2 min )
    DQN keep return same action for most of the state
    Things that I tried softmax policy when choosing action crazy low epsilon decay rate, even tried to run a tons of episodes with 1.0 epsilon n step learning, cuz sparse reward what else can help to deal with this prob? submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    Happiness and the idea of a Lobotomy
    A Book on Happiness Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 3 min )
    The Banana Test, Sex & Death
    WHAT IF Adam and Eve were AI created by us? Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 5 min )
    How to Write an Essay on Any Topic Using AI
    AI systems, like Jasper AI, can write essays on any topic, just with one click- you don’t need to be an expert in writing and stay up late…  ( 4 min )
  • Open

    Detect social media fake news using graph machine learning with Amazon Neptune ML
    In recent years, social media has become a common means for sharing and consuming news. However, the spread of misinformation and fake news on these platforms has posed a major challenge to the well-being of individuals and societies. Therefore, it is imperative that we develop robust and automated solutions for early detection of fake news […]  ( 11 min )
    Optimize F1 aerodynamic geometries via Design of Experiments and machine learning
    FORMULA 1 (F1) cars are the fastest regulated road-course racing vehicles in the world. Although these open-wheel automobiles are only 20–30 kilometers (or 12–18 miles) per-hour faster than top-of-the-line sports cars, they can speed around corners up to five times as fast due to the powerful aerodynamic downforce they create. Downforce is the vertical force […]  ( 10 min )
    Build a risk management machine learning workflow on Amazon SageMaker with no code
    Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow. Amazon SageMaker […]  ( 9 min )
  • Open

    ‘Fortnite’ Arrives This GFN Thursday With GeForce Performance You Can Touch
    Fortnite on GeForce NOW with touch controls on mobile is now available to all members, streaming through the Safari web browser on iOS and the GeForce NOW Android app. The full launch — including the removal of the waitlist — follows a successful beta period in which more than 500,000 participants streamed over 4 million Read article > The post ‘Fortnite’ Arrives This GFN Thursday With GeForce Performance You Can Touch appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Types of tasks in Machine Learning 👇
    submitted by /u/mr-minion [link] [comments]
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/maneesh123456 [link] [comments]
    Image-blending neural network
    Hi everyone Can someone give a link to the neural network that takes several source images and merges them into one. I know there's one and seen it working about a half year ago. But I'm unable to find it now. All I've found is imageblender.com which does a similar thing though it uses a text description as an input. submitted by /u/cha_zz [link] [comments]  ( 1 min )
  • Open

    The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs
    Data management agility has become of key importance to organizations as the amount and complexity of data continues to increase, along with the desire to avoid creating new data silos. The concept of creating a ‘data fabric’ as an agile design concept has been proposed by leading analysts, such as Mark Beyer, Distinguished VP Analyst… Read More »The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs The post The Foundation of Data Fabrics and AI: Semantic Knowledge Graphs appeared first on Data Science Central.  ( 4 min )
    Artificial Intelligence and the Future of Medical Imaging
    Artificial intelligence (AI) is the imitation of human intelligence progressions by machines, mainly computer systems. Artificial intelligence has extensive applications in the healthcare sector. AI solutions assist healthcare providers in several aspects of patient care and administrative processes. Medical imaging can be defined as the diagnostic procedure that encompasses the formation of visual assistance and… Read More »Artificial Intelligence and the Future of Medical Imaging The post Artificial Intelligence and the Future of Medical Imaging appeared first on Data Science Central.  ( 3 min )
    How to Use the Resources of MQL5.community to Empower Your Own Business
    Since its inception, algorithmic trading has been a popular strategy for investors. It uses mathematical rules to automate the trading of various assets, such as stocks and futures. However, it has been very challenging for people who don’t have the necessary skills and knowledge — as reported by Psychology Today. According to Nasdaq, one of… Read More »How to Use the Resources of MQL5.community to Empower Your Own Business The post How to Use the Resources of MQL5.community to Empower Your Own Business appeared first on Data Science Central.  ( 4 min )
  • Open

    Enhancing the Transformer Decoder with Transition-based Syntax. (arXiv:2101.12640v3 [cs.CL] UPDATED)
    Notwithstanding recent advances, syntactic generalization remains a challenge for text decoders. While some studies showed gains from incorporating source-side symbolic syntactic and semantic structure into text generation Transformers, very little work addressed the decoding of such structure. We propose a general approach for tree decoding using a transition-based approach. Examining the challenging test case of incorporating Universal Dependencies syntax into machine translation, we present substantial improvements on test sets that focus on syntactic generalization, while presenting improved or comparable performance on standard MT benchmarks. Further qualitative analysis addresses cases where syntactic generalization in the vanilla Transformer decoder is inadequate and demonstrates the advantages afforded by integrating syntactic information.  ( 2 min )
    A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits. (arXiv:2106.05472v2 [math.PR] UPDATED)
    This paper studies a multi-armed bandit problem where the decision-maker is loss averse, in particular she is risk averse in the domain of gains and risk loving in the domain of losses. The focus is on large horizons. Consequences of loss aversion for asymptotic (large horizon) properties are derived in a number of analytical results. The analysis is based on a new central limit theorem for a set of measures under which conditional variances can vary in a largely unstructured history-dependent way subject only to the restriction that they lie in a fixed interval.  ( 2 min )
    A Unified Linear Speedup Analysis of Stochastic FedAvg and Nesterov Accelerated FedAvg. (arXiv:2007.05690v3 [cs.LG] UPDATED)
    Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly with regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)--arguably the most popular and effective FL algorithm class in use today--and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, a systematic study of how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting is lacking--a crucial issue whose answer would shed light on the performance of FedAvg in large FL systems in practice. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under strongly convex smooth, convex smooth problems, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in convex settings. Empirical studies of the algorithms in various settings have supported our theoretical results.  ( 3 min )
    Fair and Green Hyperparameter Optimization via Multi-objective and Multiple Information Source Bayesian Optimization. (arXiv:2205.08835v1 [cs.LG])
    There is a consensus that focusing only on accuracy in searching for optimal machine learning models amplifies biases contained in the data, leading to unfair predictions and decision supports. Recently, multi-objective hyperparameter optimization has been proposed to search for machine learning models which offer equally Pareto-efficient trade-offs between accuracy and fairness. Although these approaches proved to be more versatile than fairness-aware machine learning algorithms -- which optimize accuracy constrained to some threshold on fairness -- they could drastically increase the energy consumption in the case of large datasets. In this paper we propose FanG-HPO, a Fair and Green Hyperparameter Optimization (HPO) approach based on both multi-objective and multiple information source Bayesian optimization. FanG-HPO uses subsets of the large dataset (aka information sources) to obtain cheap approximations of both accuracy and fairness, and multi-objective Bayesian Optimization to efficiently identify Pareto-efficient machine learning models. Experiments consider two benchmark (fairness) datasets and two machine learning algorithms (XGBoost and Multi-Layer Perceptron), and provide an assessment of FanG-HPO against both fairness-aware machine learning algorithms and hyperparameter optimization via a multi-objective single-source optimization algorithm in BoTorch, a state-of-the-art platform for Bayesian Optimization.  ( 2 min )
    Conformalized Online Learning: Online Calibration Without a Holdout Set. (arXiv:2205.09095v1 [cs.LG])
    We develop a framework for constructing uncertainty sets with a valid coverage guarantee in an online setting, in which the underlying data distribution can drastically -- and even adversarially -- shift over time. The technique we propose is highly flexible as it can be integrated with any online learning algorithm, requiring minimal implementation effort and computational cost. A key advantage of our method over existing alternatives -- which also build on conformal inference -- is that we do not need to split the data into training and holdout calibration sets. This allows us to fit the predictive model in a fully online manner, utilizing the most recent observation for constructing calibrated uncertainty sets. Consequently, and in contrast with existing techniques, (i) the sets we build can quickly adapt to new changes in the distribution; and (ii) our procedure does not require refitting the model at each time step. Using synthetic and real-world benchmark data sets, we demonstrate the validity of our theory and the improved performance of our proposal over existing techniques. To demonstrate the greater flexibility of the proposed method, we show how to construct valid intervals for a multiple-output regression problem that previous sequential calibration methods cannot handle due to impractical computational and memory requirements.  ( 2 min )
    Constraining the Attack Space of Machine Learning Models with Distribution Clamping Preprocessing. (arXiv:2205.08989v1 [cs.LG])
    Preprocessing and outlier detection techniques have both been applied to neural networks to increase robustness with varying degrees of success. In this paper, we formalize the ideal preprocessor function as one that would take any input and set it to the nearest in-distribution input. In other words, we detect any anomalous pixels and set them such that the new input is in-distribution. We then illustrate a relaxed solution to this problem in the context of patch attacks. Specifically, we demonstrate that we can model constraints on the patch attack that specify regions as out of distribution. With these constraints, we are able to preprocess inputs successfully, increasing robustness on CARLA object detection.  ( 2 min )
    SimCSE: Simple Contrastive Learning of Sentence Embeddings. (arXiv:2104.08821v4 [cs.CL] UPDATED)
    This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.  ( 2 min )
    GeoPointGAN: Synthetic Spatial Data with Local Label Differential Privacy. (arXiv:2205.08886v1 [cs.LG])
    Synthetic data generation is a fundamental task for many data management and data science applications. Spatial data is of particular interest, and its sensitive nature often leads to privacy concerns. We introduce GeoPointGAN, a novel GAN-based solution for generating synthetic spatial point datasets with high utility and strong individual level privacy guarantees. GeoPointGAN's architecture includes a novel point transformation generator that learns to project randomly generated point co-ordinates into meaningful synthetic co-ordinates that capture both microscopic (e.g., junctions, squares) and macroscopic (e.g., parks, lakes) geographic features. We provide our privacy guarantees through label local differential privacy, which is more practical than traditional local differential privacy. We seamlessly integrate this level of privacy into GeoPointGAN by augmenting the discriminator to the point level and implementing a randomized response-based mechanism that flips the labels associated with the 'real' and 'fake' points used in training. Extensive experiments show that GeoPointGAN significantly outperforms recent solutions, improving by up to 10 times compared to the most competitive baseline. We also evaluate GeoPointGAN using range, hotspot, and facility location queries, which confirm the practical effectiveness of GeoPointGAN for privacy-preserving querying. The results illustrate that a strong level of privacy is achieved with little-to-no adverse utility cost, which we explain through the generalization and regularization effects that are realized by flipping the labels of the data during training.  ( 2 min )
    Pluralistic Image Completion with Probabilistic Mixture-of-Experts. (arXiv:2205.09086v1 [cs.CV])
    Pluralistic image completion focuses on generating both visually realistic and diverse results for image completion. Prior methods enjoy the empirical successes of this task. However, their used constraints for pluralistic image completion are argued to be not well interpretable and unsatisfactory from two aspects. First, the constraints for visual reality can be weakly correlated to the objective of image completion or even redundant. Second, the constraints for diversity are designed to be task-agnostic, which causes the constraints to not work well. In this paper, to address the issues, we propose an end-to-end probabilistic method. Specifically, we introduce a unified probabilistic graph model that represents the complex interactions in image completion. The entire procedure of image completion is then mathematically divided into several sub-procedures, which helps efficient enforcement of constraints. The sub-procedure directly related to pluralistic results is identified, where the interaction is established by a Gaussian mixture model (GMM). The inherent parameters of GMM are task-related, which are optimized adaptively during training, while the number of its primitives can control the diversity of results conveniently. We formally establish the effectiveness of our method and demonstrate it with comprehensive experiments.  ( 2 min )
    Bridging the gap between QP-based and MPC-based RL. (arXiv:2205.08856v1 [eess.SY])
    Reinforcement learning methods typically use Deep Neural Networks to approximate the value functions and policies underlying a Markov Decision Process. Unfortunately, DNN-based RL suffers from a lack of explainability of the resulting policy. In this paper, we instead approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs). We propose simple tools to promote structures in the QP, pushing it to resemble a linear MPC scheme. A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy, additionally provides ways for its analysis. The tools we propose allow for continuously adjusting the trade-off between the former and the latter during learning. We illustrate the workings of our proposed method with the resulting structure using a point-mass task.  ( 2 min )
    Imagining new futures beyond predictive systems in child welfare: A qualitative study with impacted stakeholders. (arXiv:2205.08928v1 [cs.HC])
    Child welfare agencies across the United States are turning to data-driven predictive technologies (commonly called predictive analytics) which use government administrative data to assist workers' decision-making. While some prior work has explored impacted stakeholders' concerns with current uses of data-driven predictive risk models (PRMs), less work has asked stakeholders whether such tools ought to be used in the first place. In this work, we conducted a set of seven design workshops with 35 stakeholders who have been impacted by the child welfare system or who work in it to understand their beliefs and concerns around PRMs, and to engage them in imagining new uses of data and technologies in the child welfare system. We found that participants worried current PRMs perpetuate or exacerbate existing problems in child welfare. Participants suggested new ways to use data and data-driven tools to better support impacted communities and suggested paths to mitigate possible harms of these tools. Participants also suggested low-tech or no-tech alternatives to PRMs to address problems in child welfare. Our study sheds light on how researchers and designers can work in solidarity with impacted communities, possibly to circumvent or oppose child welfare agencies.  ( 2 min )
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v1 [cs.LG])
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Our result may already hold for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.  ( 2 min )
    Large Neural Networks Learning from Scratch with Very Few Data and without Regularization. (arXiv:2205.08836v1 [cs.CV])
    Recent findings have shown that Neural Networks generalize also in over-parametrized regimes with zero training error. This is surprising, since it is completely against traditional machine learning wisdom. In our empirical study we fortify these findings in the domain of fine-grained image classification. We show that very large Convolutional Neural Networks with millions of weights do learn with only a handful of training samples and without image augmentation, explicit regularization or pretraining. We train the architectures ResNet018, ResNet101 and VGG19 on subsets of the difficult benchmark datasets Caltech101, CUB_200_2011, FGVCAircraft, Flowers102 and StanfordCars with 100 classes and more, perform a comprehensive comparative study and draw implications for the practical application of CNNs. Finally, we show that VGG19 with 140 million weights learns to distinguish airplanes and motorbikes up to 95% accuracy with only 20 samples per class.  ( 2 min )
    Structural Extensions of Basis Pursuit: Guarantees on Adversarial Robustness. (arXiv:2205.08955v1 [cs.LG])
    While deep neural networks are sensitive to adversarial noise, sparse coding using the Basis Pursuit (BP) method is robust against such attacks, including its multi-layer extensions. We prove that the stability theorem of BP holds upon the following generalizations: (i) the regularization procedure can be separated into disjoint groups with different weights, (ii) neurons or full layers may form groups, and (iii) the regularizer takes various generalized forms of the $\ell_1$ norm. This result provides the proof for the architectural generalizations of Cazenavette et al. (2021), including (iv) an approximation of the complete architecture as a shallow sparse coding network. Due to this approximation, we settled to experimenting with shallow networks and studied their robustness against the Iterative Fast Gradient Sign Method on a synthetic dataset and MNIST. We introduce classification based on the $\ell_2$ norms of the groups and show numerically that it can be accurate and offers considerable speedups. In this family, linear transformer shows the best performance. Based on the theoretical results and the numerical simulations, we highlight numerical matters that may improve performance further.  ( 2 min )
    Learning latent representations for operational nitrogen response rate prediction. (arXiv:2205.09025v1 [cs.LG])
    Learning latent representations has aided operational decision-making in several disciplines. Its advantages include uncovering hidden interactions in data and automating procedures which were performed manually in the past. Representation learning is also being adopted by earth and environmental sciences. However, there are still subfields that depend on manual feature engineering based on expert knowledge and the use of algorithms which do not utilize the latent space. Relying on those techniques can inhibit operational decision-making since they impose data constraints and inhibit automation. In this work, we adopt a case study for nitrogen response rate prediction and examine if representation learning can be used for operational use. We compare a Multilayer Perceptron, an Autoencoder, and a dual-head Autoencoder with a reference Random Forest model for nitrogen response rate prediction. To bring the predictions closer to an operational setting we assume absence of future weather data, and we are evaluating the models using error metrics and a domain-derived error threshold. The results show that learning latent representations can provide operational nitrogen response rate predictions by offering performance equal and sometimes better than the reference model.  ( 2 min )
    Exploring the Advantages of Dense-Vector to One-Hot Encoding of Intent Classes in Out-of-Scope Detection Tasks. (arXiv:2205.09021v1 [cs.LG])
    This work explores the intrinsic limitations of the popular one-hot encoding method in classification of intents when detection of out-of-scope (OOS) inputs is required. Although recent work has shown that there can be significant improvements in OOS detection when the intent classes are represented as dense-vectors based on domain specific knowledge, we argue in this paper that such gains are more likely due to advantages of dense-vector to one-hot encoding methods in representing the complexity of the OOS space. We start by showing how dense-vector encodings can create OOS spaces with much richer topologies than one-hot encoding methods. We then demonstrate empirically, using four standard intent classification datasets, that knowledge-free, randomly generated dense-vector encodings of intent classes can yield massive, over 20% gains over one-hot encodings, and also outperform the previous, domain knowledge-based, SOTA of one of the datasets. We finish by describing a novel algorithm to search for good dense-vector encodings and present initial but promising experimental results of its use.  ( 2 min )
    Generating Explanations from Deep Reinforcement Learning Using Episodic Memory. (arXiv:2205.08926v1 [cs.AI])
    Deep Reinforcement Learning (RL) involves the use of Deep Neural Networks (DNNs) to make sequential decisions in order to maximize reward. For many tasks the resulting sequence of actions produced by a Deep RL policy can be long and difficult to understand for humans. A crucial component of human explanations is selectivity, whereby only key decisions and causes are recounted. Imbuing Deep RL agents with such an ability would make their resulting policies easier to understand from a human perspective and generate a concise set of instructions to aid the learning of future agents. To this end we use a Deep RL agent with an episodic memory system to identify and recount key decisions during policy execution. We show that these decisions form a short, human readable explanation that can also be used to speed up the learning of naive Deep RL agents in an algorithm-independent manner.  ( 2 min )
    World Value Functions: Knowledge Representation for Multitask Reinforcement Learning. (arXiv:2205.08827v1 [cs.LG])
    An open problem in artificial intelligence is how to learn and represent knowledge that is sufficient for a general agent that needs to solve multiple tasks in a given world. In this work we propose world value functions (WVFs), which are a type of general value function with mastery of the world - they represent not only how to solve a given task, but also how to solve any other goal-reaching task. To achieve this, we equip the agent with an internal goal space defined as all the world states where it experiences a terminal transition - a task outcome. The agent can then modify task rewards to define its own reward function, which provably drives it to learn how to achieve all achievable internal goals, and the value of doing so in the current task. We demonstrate a number of benefits of WVFs. When the agent's internal goal space is the entire state space, we demonstrate that the transition function can be inferred from the learned WVF, which allows the agent to plan using learned value functions. Additionally, we show that for tasks in the same world, a pretrained agent that has learned any WVF can then infer the policy and value function for any new task directly from its rewards. Finally, an important property for long-lived agents is the ability to reuse existing knowledge to solve new tasks. Using WVFs as the knowledge representation for learned tasks, we show that an agent is able to solve their logical combination zero-shot, resulting in a combinatorially increasing number of skills throughout their lifetime.  ( 2 min )
    Position Aided Beam Prediction in the Real World: How Useful GPS Locations Actually Are?. (arXiv:2205.09054v1 [eess.SP])
    Millimeter-wave (mmWave) communication systems rely on narrow beams for achieving sufficient receive signal power. Adjusting these beams is typically associated with large training overhead, which becomes particularly critical for highly-mobile applications. Intuitively, since optimal beam selection can benefit from the knowledge of the positions of communication terminals, there has been increasing interest in leveraging position data to reduce the overhead in mmWave beam prediction. Prior work, however, studied this problem using only synthetic data that generally does not accurately represent real-world measurements. In this paper, we investigate position-aided beam prediction using a real-world large-scale dataset to derive insights into precisely how much overhead can be saved in practice. Furthermore, we analyze which machine learning algorithms perform best, what factors degrade inference performance in real data, and which machine learning metrics are more meaningful in capturing the actual communication system performance.  ( 2 min )
    Sharp asymptotics on the compression of two-layer neural networks. (arXiv:2205.08199v2 [cs.IT] UPDATED)
    In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M < N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N . For a ReLU activation function, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.
    Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml. (arXiv:2205.07690v1 [cs.CV] CROSS LISTED)
    In this paper, we investigate how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image, using less than 30% of the available resources on a Xilinx ZCU102 evaluation board. The latency is reduced to 3 ms per image when increasing the batch size to ten, corresponding to the use case where the autonomous vehicle receives inputs from multiple cameras simultaneously. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced while maintaining accuracy on the Cityscapes dataset.
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v2 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). Then, we approximate the resultant GP by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). ICK is flexible and can be used to include prior information in neural networks in many applications. We demonstrate the strength of our framework by showing its superior performance and flexibility on both synthetic and real-world data sets. The code is available at: https://anonymous.4open.science/r/ICK_NNGP-17C5/.
    "What makes a question inquisitive?" A Study on Type-Controlled Inquisitive Question Generation. (arXiv:2205.08056v2 [cs.CL] UPDATED)
    We propose a type-controlled framework for inquisitive question generation. We annotate an inquisitive question dataset with question types, train question type classifiers, and finetune models for type-controlled question generation. Empirical results demonstrate that we can generate a variety of questions that adhere to specific types while drawing from the source texts. We also investigate strategies for selecting a single question from a generated set, considering both an informative vs.~inquisitive question classifier and a pairwise ranker trained from a small set of expert annotations. Question selection using the pairwise ranker yields strong results in automatic and manual evaluation. Our human evaluation assesses multiple aspects of the generated questions, finding that the ranker chooses questions with the best syntax (4.59), semantics (4.37), and inquisitiveness (3.92) on a scale of 1-5, even rivaling the performance of human-written questions.
    A Comparative Analysis of Machine Learning Techniques for IoT Intrusion Detection. (arXiv:2111.13149v2 [cs.CR] CROSS LISTED)
    The digital transformation faces tremendous security challenges. In particular, the growing number of cyber-attacks targeting Internet of Things (IoT) systems restates the need for a reliable detection of malicious network activity. This paper presents a comparative analysis of supervised, unsupervised and reinforcement learning techniques on nine malware captures of the IoT-23 dataset, considering both binary and multi-class classification scenarios. The developed models consisted of Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Isolation Forest (iForest), Local Outlier Factor (LOF) and a Deep Reinforcement Learning (DRL) model based on a Double Deep Q-Network (DDQN), adapted to the intrusion detection context. The most reliable performance was achieved by LightGBM. Nonetheless, iForest displayed good anomaly detection results and the DRL model demonstrated the possible benefits of employing this methodology to continuously improve the detection. Overall, the obtained results indicate that the analyzed techniques are well suited for IoT intrusion detection.
    GPU-accelerated partially linear multiuser detection for 5G and beyond URLLC systems. (arXiv:2201.05024v3 [eess.SP] UPDATED)
    In this feasibility study, we have implemented a recently proposed partially linear multiuser detection algorithm in reproducing kernel Hilbert spaces (RKHSs) on a GPU-accelerated platform. Partially linear multiuser detection, which combines the robustness of linear detection with the power of nonlinear methods, has been proposed for a massive connectivity scenario with the non-orthogonal multiple access (NOMA). This is a promising approach, but detecting payloads within a received orthogonal frequency division multiplexing (OFDM) radio frame requires the execution of a large number of inner product operations, which are the main computational burden of the algorithm. Although inner-product operations consist of simple kernel evaluations, their vast number poses a challenge in ultra-low latency (ULL) applications, because the time needed for computing the inner products might exceed the sub-millisecond latency requirement. To address this problem, this study demonstrates the acceleration of the inner-product operations through massive parallelization. The result is a GPU-accelerated real-time OFDM receiver that enables sub-millisecond latency detection to meet the requirements of 5th generation (5G) and beyond ultra-reliable and low latency communications (URLLC) systems. Moreover, the parallelization and acceleration techniques explored and demonstrated in this study can be extended to many other signal processing algorithms in Hilbert spaces, such as those based on projection onto convex sets (POCS) and adaptive projected subgradient method (APSM) algorithms. Experimental results and comparisons with the state-of-art confirm the effectiveness of our techniques.
    MonoTrack: Shuttle trajectory reconstruction from monocular badminton video. (arXiv:2204.01899v2 [cs.CV] UPDATED)
    Trajectory estimation is a fundamental component of racket sport analytics, as the trajectory contains information not only about the winning and losing of each point, but also how it was won or lost. In sports such as badminton, players benefit from knowing the full 3D trajectory, as the height of shuttlecock or ball provides valuable tactical information. Unfortunately, 3D reconstruction is a notoriously hard problem, and standard trajectory estimators can only track 2D pixel coordinates. In this work, we present the first complete end-to-end system for the extraction and segmentation of 3D shuttle trajectories from monocular badminton videos. Our system integrates badminton domain knowledge such as court dimension, shot placement, physical laws of motion, along with vision-based features such as player poses and shuttle tracking. We find that significant engineering efforts and model improvements are needed to make the overall system robust, and as a by-product of our work, improve state-of-the-art results on court recognition, 2D trajectory estimation, and hit recognition.
    Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning. (arXiv:2204.09418v2 [cs.MA] UPDATED)
    Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
    Noise mitigation strategies in physical feedforward neural networks. (arXiv:2204.09461v2 [cs.NE] UPDATED)
    Physical neural networks are promising candidates for next generation artificial intelligence hardware. In such architectures, neurons and connections are physically realized and do not leverage digital concepts with their practically infinite signal-to-noise ratio to encode, transduce and transform information. They therefore are prone to noise with a variety of statistical and architectural properties, and effective strategies leveraging network-inherent assets to mitigate noise in an hardware-efficient manner are important in the pursuit of next generation neural network hardware. Based on analytical derivations, we here introduce and analyse a variety of different noise-mitigation approaches. We analytically show that intra-layer connections in which the connection matrix's squared mean exceeds the mean of its square fully suppresses uncorrelated noise. We go beyond and develop two synergistic strategies for noise that is uncorrelated and correlated across populations of neurons. First, we introduce the concept of ghost neurons, where each group of neurons perturbed by correlated noise has a negative connection to a single neuron, yet without receiving any input information. Secondly, we show that pooling of neuron populations is an efficient approach to suppress uncorrelated noise. As such, we developed a general noise mitigation strategy leveraging the statistical properties of the different noise terms most relevant in analogue hardware. Finally, we demonstrate the effectiveness of this combined approach for trained neural network classifying the MNIST handwritten digits, for which we achieve a 4-fold improvement of the output signal-to-noise ratio and increase the classification accuracy almost to the level of the noise-free network.
    Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes. (arXiv:2205.01053v2 [cs.LG] UPDATED)
    Our work aims at developing reinforcement learning algorithms that do not rely on the Markov assumption. We consider the class of Non-Markov Decision Processes where histories can be abstracted into a finite set of states while preserving the dynamics. We call it a Markov abstraction since it induces a Markov Decision Process over a set of states that encode the non-Markov dynamics. This phenomenon underlies the recently introduced Regular Decision Processes (as well as POMDPs where only a finite number of belief states is reachable). In all such kinds of decision process, an agent that uses a Markov abstraction can rely on the Markov property to achieve optimal behaviour. We show that Markov abstractions can be learned during reinforcement learning. Our approach combines automata learning and classic reinforcement learning. For these two tasks, standard algorithms can be employed. We show that our approach has PAC guarantees when the employed algorithms have PAC guarantees, and we also provide an experimental evaluation.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v2 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.
    Practical Insights of Repairing Model Problems on Image Classification. (arXiv:2205.07116v1 [cs.LG] CROSS LISTED)
    Additional training of a deep learning model can cause negative effects on the results, turning an initially positive sample into a negative one (degradation). Such degradation is possible in real-world use cases due to the diversity of sample characteristics. That is, a set of samples is a mixture of critical ones which should not be missed and less important ones. Therefore, we cannot understand the performance by accuracy alone. While existing research aims to prevent a model degradation, insights into the related methods are needed to grasp their benefits and limitations. In this talk, we will present implications derived from a comparison of methods for reducing degradation. Especially, we formulated use cases for industrial settings in terms of arrangements of a data set. The results imply that a practitioner should care about better method continuously considering dataset availability and life cycle of an AI system because of a trade-off between accuracy and preventing degradation.
    ACReL: Adversarial Conditional value-at-risk Reinforcement Learning. (arXiv:2109.09470v2 [cs.LG] UPDATED)
    In the classical Reinforcement Learning (RL) setting, one aims to find a policy that maximizes its expected return. This objective may be inappropriate in safety-critical domains such as healthcare or autonomous driving, where intrinsic uncertainties due to stochastic policies and environment variability may lead to catastrophic failures. This can be addressed by using the Conditional-Value-at-Risk (CVaR) objective to instill risk-aversion in learned policies. In this paper, we propose Adversarial Cvar Reinforcement Learning (ACReL), a novel adversarial meta-algorithm to optimize the CVaR objective in RL. ACReL is based on a max-min between a policy player and a learned adversary that perturbs the policy player's state transitions given a finite budget. We prove that, the closer the players are to the game's equilibrium point, the closer the learned policy is to the CVaR-optimal one with a risk tolerance explicitly related to the adversary's budget. We provide a gradient-based training procedure to solve the proposed game by formulating it as a Stackelberg game, enabling the use of deep RL architectures and training algorithms. Empirical experiments show that ACReL matches a CVaR RL state-of-the-art baseline for retrieving CVaR optimal policies, while also benefiting from theoretical guarantees.
    FastCover: An Unsupervised Learning Framework for Multi-Hop Influence Maximization in Social Networks. (arXiv:2111.00463v2 [cs.SI] UPDATED)
    Finding influential users in social networks is a fundamental problem with many possible useful applications. Viewing the social network as a graph, the influence of a set of users can be measured by the number of neighbors located within a given number of hops in the network, where each hop marks a step of influence diffusion. In this paper, we reduce the problem of IM to a budget-constrained d-hop dominating set problem (kdDSP). We propose a unified machine learning (ML) framework, FastCover, to solve kdDSP by learning an efficient greedy strategy in an unsupervised way. As one critical component of the framework, we devise a novel graph neural network (GNN) architecture, graph reversed attention network (GRAT), that captures the diffusion process among neighbors. Unlike most heuristic algorithms and concurrent ML frameworks for combinatorial optimization problems, FastCover determines the entire seed set from the nodes' scores computed with only one forward propagation of the GNN and has a time complexity quasi-linear in the graph size. Experiments on synthetic graphs and real-world social networks demonstrate that FastCover finds solutions with better or comparable quality rendered by the concurrent algorithms while achieving a speedup of over 1000x.
    SLISEMAP: Supervised dimensionality reduction through local explanations. (arXiv:2201.04455v2 [cs.LG] UPDATED)
    Existing methods for explaining black box learning models often focus on building local explanations of model behaviour for a particular data item. It is possible to create global explanations for all data items, but these explanations generally have low fidelity for complex black box models. We propose a new supervised manifold visualisation method, SLISEMAP, that simultaneously finds local explanations for all data items and builds a (typically) two-dimensional global visualisation of the black box model such that data items with similar local explanations are projected nearby. We provide a mathematical derivation of our problem and an open source implementation implemented using the GPU-optimised PyTorch library. We compare SLISEMAP to multiple popular dimensionality reduction methods and find that SLISEMAP is able to utilise labelled data to create embeddings with consistent local white box models. We also compare SLISEMAP to other model-agnostic local explanation methods and show that SLISEMAP provides comparable explanations and that the visualisations can give a broader understanding of black box regression and classification models.
    Preserving Privacy and Security in Federated Learning. (arXiv:2202.03402v2 [cs.LG] UPDATED)
    Federated learning is known to be vulnerable to both security and privacy issues. Existing research has focused either on preventing poisoning attacks from users or on concealing the local model updates from the server, but not both. However, integrating these two lines of research remains a crucial challenge since they often conflict with one another with respect to the threat model. In this work, we develop a principle framework that offers both privacy guarantees for users and detection against poisoning attacks from them. With a new threat model that includes both an honest-but-curious server and malicious users, we first propose a secure aggregation protocol using homomorphic encryption for the server to combine local model updates in a private manner. Then, a zero-knowledge proof protocol is leveraged to shift the task of detecting attacks in the local models from the server to the users. The key observation here is that the server no longer needs access to the local models for attack detection. Therefore, our framework enables the central server to identify poisoned model updates without violating the privacy guarantees of secure aggregation.
    Probing Pretrained Models of Source Code. (arXiv:2202.08975v2 [cs.SE] UPDATED)
    Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task. However, recently general pretrained models such as CodeBERT or CodeT5 have been shown to outperform task-specific models in many applications. While pretrained models are known to learn complex patterns from data, they may fail to understand some properties of source code. To test diverse aspects of code understanding, we introduce a set of diagnosting probing tasks. We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and namespaces, and natural language naming. We also investigate how probing results are affected by using code-specific pretraining objectives, varying the model size, or finetuning.
    Molformer: Motif-based Transformer on 3D Heterogeneous Molecular Graphs. (arXiv:2110.01191v5 [q-bio.QM] UPDATED)
    Procuring expressive molecular representations underpins AI-driven molecule design and scientific discovery. The research to date mainly focuses on atom-level homogeneous molecular graphs, ignoring the rich information in subgraphs or motifs. However, it has been widely accepted that substructures play a dominant role in the identification and determination of molecular properties. To address such issues, we formulate heterogeneous molecular graphs (HMGs), and introduce Molformer to exploit both molecular motifs and 3D geometry. Specifically, we extract functional groups as motifs for small molecules and resort to the reinforcement learning to adaptively select quaternary amino acids as motifs for proteins. Then HMGs are constructed with both atom-level and motif-level nodes. To better accommodate those HMGs, we introduce a variant of Transformer named Molformer, which adopts a heterogeneous self-attention layer to distinguish the interactions between multi-level nodes. Besides, it is also coupled with a multi-scale mechanism to capture local fine-grained patterns with increasing contextual scales. An attentive farthest point sampling algorithm is also proposed to obtain the molecular representations. We validate Molformer across a few domains including quantum chemistry, physiology, and biophysics. Experiments show that Molformer outperforms state-of-the-art baselines. Our work provides a promising way to utilize informative motifs from the perspective of multi-level graph construction.
    HAVEN: Hierarchical Cooperative Multi-Agent Reinforcement Learning with Dual Coordination Mechanism. (arXiv:2110.07246v2 [cs.MA] UPDATED)
    Multi-agent reinforcement learning often suffers from the exponentially large action space caused by a large number of agents. This paper proposes a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for the fully cooperative multi-agent problems. To address the instability that arises from the concurrent optimization of high-level and low-level policies and another concurrent optimization of agents, we introduce the dual coordination mechanism of inter-layer strategies and inter-agent strategies. HAVEN does not require domain knowledge and pretraining, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and offers an efficient solution to partition the decision space implicitly.
    Doubly Robust Collaborative Targeted Learning for Debiased Recommendations. (arXiv:2203.10258v2 [cs.IR] UPDATED)
    In recommender systems, the collected data always contains various biases and leads to the challenge of accurate predictions. To address selection bias and confounding bias, the doubly robust (DR) method and its variants show superior performance due to the double robustness property and smaller bias under inaccurate propensity and error imputation models. However, we theoretically show that the variance of the error imputation-based (EIB) method is much smaller than that of DR, although EIB may suffer from a much larger bias. In this paper, we propose a doubly robust targeted learning method that effectively combines the small-bias property of DR and the small-variance property of EIB, by leveraging the targeted maximum likelihood estimation technique. Theoretical analysis shows that the proposed targeted learning is effective in reducing the variance of DR while maintaining double robustness. To further reduce the bias and variance during the training process, we propose a novel collaborative targeted learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.
    Predicting Berth Stay for Tanker Terminals: A Systematic and Dynamic Approach. (arXiv:2204.04085v2 [cs.CE] UPDATED)
    Given the trend of digitization and increasing number of maritime transport, prediction of vessel berth stay has been triggered for requirements of operation research and scheduling optimization problem in the era of maritime big data, which takes a significant part in port efficiency and maritime logistics enhancement. This study proposes a systematic and dynamic approach of predicting berth stay for tanker terminals. The approach covers three innovative aspects: 1) Data source employed is multi-faceted, including cargo operation data from tanker terminals, time-series data from automatic identification system (AIS), etc. 2) The process of berth stay is decomposed into multiple blocks according to data analysis and information extraction innovatively, and practical operation scenarios are also developed accordingly. 3) The predictive models of berth stay are developed on the basis of prior data analysis and information extraction under two methods, including regression and decomposed distribution. The models are evaluated under four dynamic scenarios with certain designated cargoes among two different terminals. The evaluation results show that the proposed approach can predict berth stay with the accuracy up to 98.81% validated by historical baselines, and also demonstrate the proposed approach has dynamic capability of predicting berth stay among the scenarios. The model may be potentially applied for short-term pilot-booking or scheduling optimizations within a reasonable time frame for advancement of port intelligence and logistics efficiency.
    Decentral and Incentivized Federated Learning Frameworks: A Systematic Literature Review. (arXiv:2205.07855v2 [cs.LG] UPDATED)
    The advent of Federated Learning (FL) has ignited a new paradigm for parallel and confidential decentralized Machine Learning (ML) with the potential of utilizing the computational power of a vast number of IoT, mobile and edge devices without data leaving the respective device, ensuring privacy by design. Yet, in order to scale this new paradigm beyond small groups of already entrusted entities towards mass adoption, the Federated Learning Framework (FLF) has to become (i) truly decentralized and (ii) participants have to be incentivized. This is the first systematic literature review analyzing holistic FLFs in the domain of both, decentralized and incentivized federated learning. 422 publications were retrieved, by querying 12 major scientific databases. Finally, 40 articles remained after a systematic review and filtering process for in-depth examination. Although having massive potential to direct the future of a more distributed and secure AI, none of the analyzed FLF is production-ready. The approaches vary heavily in terms of use-cases, system design, solved issues and thoroughness. We are the first to provide a systematic approach to classify and quantify differences between FLF, exposing limitations of current works and derive future directions for research in this novel domain.
    You Only Cut Once: Boosting Data Augmentation with a Single Cut. (arXiv:2201.12078v2 [cs.CV] UPDATED)
    We present You Only Cut Once (YOCO) for performing data augmentations. YOCO cuts one image into two pieces and performs data augmentations individually within each piece. Applying YOCO improves the diversity of the augmentation per sample and encourages neural networks to recognize objects from partial information. YOCO enjoys the properties of parameter-free, easy usage, and boosting almost all augmentations for free. Thorough experiments are conducted to evaluate its effectiveness. We first demonstrate that YOCO can be seamlessly applied to varying data augmentations, neural network architectures, and brings performance gains on CIFAR and ImageNet classification tasks, sometimes surpassing conventional image-level augmentation by large margins. Moreover, we show YOCO benefits contrastive pre-training toward a more powerful representation that can be better transferred to multiple downstream tasks. Finally, we study a number of variants of YOCO and empirically analyze the performance for respective settings. Code is available at GitHub.
    Exploring Deep Reinforcement Learning-Assisted Federated Learning for Online Resource Allocation in Privacy-Persevering EdgeIoT. (arXiv:2202.07391v2 [cs.LG] UPDATED)
    Federated learning (FL) has been increasingly considered to preserve data training privacy from eavesdropping attacks in mobile edge computing-based Internet of Thing (EdgeIoT). On the one hand, the learning accuracy of FL can be improved by selecting the IoT devices with large datasets for training, which gives rise to a higher energy consumption. On the other hand, the energy consumption can be reduced by selecting the IoT devices with small datasets for FL, resulting in a falling learning accuracy. In this paper, we formulate a new resource allocation problem for privacy-persevering EdgeIoT to balance the learning accuracy of FL and the energy consumption of the IoT device. We propose a new federated learning-enabled twin-delayed deep deterministic policy gradient (FL-DLT3) framework to achieve the optimal accuracy and energy balance in a continuous domain. Furthermore, long short term memory (LSTM) is leveraged in FL-DLT3 to predict the time-varying network state while FL-DLT3 is trained to select the IoT devices and allocate the transmit power. Numerical results demonstrate that the proposed FL-DLT3 achieves fast convergence (less than 100 iterations) while the FL accuracy-to-energy consumption ratio is improved by 51.8% compared to existing state-of-the-art benchmark.
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v2 [q-bio.NC] UPDATED)
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.
    CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. (arXiv:2103.06874v4 [cs.CL] UPDATED)
    Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
    Describing Differences between Text Distributions with Natural Language. (arXiv:2201.12323v2 [cs.CL] UPDATED)
    How do two distributions of texts differ? Humans are slow at answering this, since discovering patterns might require tediously reading through hundreds of samples. We propose to automatically summarize the differences by "learning a natural language hypothesis": given two distributions $D_{0}$ and $D_{1}$, we search for a description that is more often true for $D_{1}$, e.g., "is military-related." To tackle this problem, we fine-tune GPT-3 to propose descriptions with the prompt: "[samples of $D_{0}$] + [samples of $D_{1}$] + the difference between them is_____." We then re-rank the descriptions by checking how often they hold on a larger set of samples with a learned verifier. On a benchmark of 54 real-world binary classification tasks, while GPT-3 Curie (13B) only generates a description similar to human annotation 7% of the time, the performance reaches 61% with fine-tuning and re-ranking, and our best system using GPT-3 Davinci (175B) reaches 76%. We apply our system to describe distribution shifts, debug dataset shortcuts, summarize unknown tasks, and label text clusters, and present analyses based on automatically generated descriptions.
    PFGE: Parsimonious Fast Geometric Ensembling of DNNs. (arXiv:2202.06658v5 [cs.LG] UPDATED)
    Ensemble methods have been widely used to improve the performance of machine learning methods in terms of generalization, while they are hard to use in deep learning systems, as training an ensemble of deep neural networks (DNNs) incurs an extremely higher computational overhead of model training. Recently, advanced techniques such as fast geometric ensembling (FGE) and snapshot ensemble have been proposed. These methods can train the model ensembles in the same time as a single model, thus getting around the hurdle of training time. However, their memory overhead for test-time inference remains much higher than single model based methods. Here we propose a parsimonious FGE (PFGE) that employs a lightweight ensemble of higher-performing DNNs, generated by successively-performed stochastic weight averaging procedures. Experimental results across different advanced DNN architectures on benchmark datasets CIFAR-$\{10,100\}$ and Imagenet, demonstrate that PFGE matches the state-of-the-art FGE method in terms of the generalization error, yet requires only 20% memory overhead for test-time inference. Our code is available at https://github.com/ZJLAB-AMMI/PFGE.
    Finite-Sum Coupled Compositional Stochastic Optimization: Theory and Applications. (arXiv:2202.12396v3 [math.OC] UPDATED)
    This paper studies stochastic optimization for a sum of compositional functions, where the inner-level function of each summand is coupled with the corresponding summation index. We refer to this family of problems as finite-sum coupled compositional optimization (FCCO). It has broad applications in machine learning for optimizing non-convex or convex compositional measures/objectives such as average precision (AP), p-norm push, listwise ranking losses, neighborhood component analysis (NCA), deep survival analysis, and deep latent variable models, which deserves finer analysis. Yet, existing algorithms and analyses are restricted in one or other aspects. The contribution of this paper is to provide a comprehensive analysis of a simple stochastic algorithm for both non-convex and convex objectives. The key results are improved oracle complexities with the parallel speed-up by the moving-average based stochastic estimator with mini-batching. Our theoretical analysis also exhibits new insights for improving the practical implementation by sampling the batches of equal size for the outer and inner levels. Numerical experiments on AP maximization, NCA, and p-norm push optimization corroborate some aspects of the theory.
    A simple yet effective baseline for non-attributed graph classification. (arXiv:1811.03508v3 [cs.LG] UPDATED)
    Graphs are complex objects that do not lend themselves easily to typical learning tasks. Recently, a range of approaches based on graph kernels or graph neural networks have been developed for graph classification and for representation learning on graphs in general. As the developed methodologies become more sophisticated, it is important to understand which components of the increasingly complex methods are necessary or most effective. As a first step, we develop a simple yet meaningful graph representation, and explore its effectiveness in graph classification. We test our baseline representation for the graph classification task on a range of graph datasets. Interestingly, this simple representation achieves similar performance as the state-of-the-art graph kernels and graph neural networks for non-attributed graph classification. Its performance on classifying attributed graphs is slightly weaker as it does not incorporate attributes. However, given its simplicity and efficiency, we believe that it still serves as an effective baseline for attributed graph classification. Our graph representation is efficient (linear-time) to compute. We also provide a simple connection with the graph neural networks. Note that these observations are only for the task of graph classification while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.
    A Scalable AutoML Approach Based on Graph Neural Networks. (arXiv:2111.00083v3 [cs.LG] UPDATED)
    AutoML systems build machine learning models automatically by performing a search over valid data transformations and learners, along with hyper-parameter optimization for each learner. Many AutoML systems use meta-learning to guide search for optimal pipelines. In this work, we present a novel meta-learning system called KGpip which, (1) builds a database of datasets and corresponding pipelines by mining thousands of scripts with program analysis, (2) uses dataset embeddings to find similar datasets in the database based on its content instead of metadata-based features, (3) models AutoML pipeline creation as a graph generation problem, to succinctly characterize the diverse pipelines seen for a single dataset. KGpip's meta-learning is a sub-component for AutoML systems. We demonstrate this by integrating KGpip with two AutoML systems. Our comprehensive evaluation using 126 datasets, including those used by the state-of-the-art systems, shows that KGpip significantly outperforms these systems.
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v2 [stat.ML] UPDATED)
    Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase$^{\mbox{\normalsize{\textregistered}}}$ dataset.
    Shape complexity in cluster analysis. (arXiv:2205.08046v2 [cs.LG] UPDATED)
    In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
    Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. (arXiv:2107.08661v5 [cs.CL] UPDATED)
    We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts.
    Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias. (arXiv:2112.06868v2 [cs.LG] UPDATED)
    Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v3 [stat.ME] UPDATED)
    Recent advances in probabilistic deep learning enable amortized Bayesian inference in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations represent reality somewhat inaccurately? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of SNPE-C (APT) and the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the learned latent data summary space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase when the distance between the latent summary distributions of the true data-generating process and the training simulations grows. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized simulation-based Bayesian inference.
    A label efficient two-sample test. (arXiv:2111.08861v3 [cs.LG] UPDATED)
    Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at \url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.
    Leveraging Global Binary Masks for Structure Segmentation in Medical Images. (arXiv:2205.09107v1 [eess.IV])
    Deep learning (DL) models for medical image segmentation are highly influenced by intensity variations of input images and lack generalization due to primarily utilizing pixels' intensity information for inference. Acquiring sufficient training data is another challenge limiting models' applications. We proposed to leverage the consistency of organs' anatomical shape and position information in medical images. We introduced a framework leveraging recurring anatomical patterns through global binary masks for organ segmentation. Two scenarios were studied.1) Global binary masks were the only model's (i.e. U-Net) input, forcing exclusively encoding organs' position and shape information for segmentation/localization.2) Global binary masks were incorporated as an additional channel functioning as position/shape clues to mitigate training data scarcity. Two datasets of the brain and heart CT images with their ground-truth were split into (26:10:10) and (12:3:5) for training, validation, and test respectively. Training exclusively on global binary masks led to Dice scores of 0.77(0.06) and 0.85(0.04), with the average Euclidian distance of 3.12(1.43)mm and 2.5(0.93)mm relative to the center of mass of the ground truth for the brain and heart structures respectively. The outcomes indicate that a surprising degree of position and shape information is encoded through global binary masks. Incorporating global binary masks led to significantly higher accuracy relative to the model trained on only CT images in small subsets of training data; the performance improved by 4.3-125.3% and 1.3-48.1% for 1-8 training cases of the brain and heart datasets respectively. The findings imply the advantages of utilizing global binary masks for building generalizable models and to compensate for training data scarcity.
    Dynamic Predictions of Postoperative Complications from Explainable, Uncertainty-Aware, and Multi-Task Deep Neural Networks. (arXiv:2004.12551v2 [cs.LG] UPDATED)
    Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform random forest models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.
    Detecting micro fractures: A comprehensive comparison of conventional and machine-learning based segmentation methods. (arXiv:2103.12821v2 [cs.LG] UPDATED)
    Studying porous rock materials with X-Ray Computed Tomography (XRCT) has been established as a standard procedure for the non-destructive visualization of flow and transport in opaque porous media. Despite the recent advances in the field of XRCT, some challenges still remain due to the inherent noise and imaging artefacts in the produced data. These issues become even more profound when the objective is the identification of fractures, and/or fracture networks. The challenge is the limited contrast between the regions of interest and the neighboring areas. This limited contrast can mostly be attributed to the minute aperture of the fractures. In order to overcome this challenge, it has been a common approach to apply digital image processing, such as filtering, to enhance the signal-to-noise ratio. Additionally, segmentation methods based on threshold-/morphology schemes can be employed to obtain enhanced information from the features of interest. However, this workflow needs a skillful operator to fine-tune its input parameters, and the required computation time significantly increases due to the complexity of the available methods, and the large volume of the data-set. In this study, based on a data-set produced by the successful visualization of a fracture network in Carrara marble with XRCT, we present the segmentation results from a number of segmentation methods. Three conventional and two machine-learning-based methods are evaluated. The segmentation results from all five methods are compared to each other in terms of segmentation quality and time efficiency. Due to memory limitations, and in order to accomplish a fair comparison, all the methods are employed in a 2D scheme. The output of the 2D U-net model, which is one of the adopted machine-learning-based segmentation methods, shows the best performance regarding the quality of segmentation and the required processing time.
    Finite-Bit Quantization For Distributed Algorithms With Linear Convergence. (arXiv:2107.11304v3 [math.OC] UPDATED)
    This paper studies distributed algorithms for (strongly convex) composite optimization problems over mesh networks, subject to quantized communications. Instead of focusing on a specific algorithmic design, a black-box model is proposed, casting linearly convergent distributed algorithms in the form of fixed-point iterates. The algorithmic model is equipped with a novel random or deterministic Biased Compression (BC) rule on the quantizer design, and a new Adaptive encoding Nonuniform Quantizer (ANQ) coupled with a communication-efficient encoding scheme, which implements the BC-rule using a finite number of bits (below machine precision). This fills a gap existing in most state-of-the-art quantization schemes, such as those based on the popular compression rule, which rely on communication of some scalar signals with negligible quantization error (in practice quantized at the machine precision). A unified communication complexity analysis is developed for the black-box model, determining the average number of bits required to reach a solution of the optimization problem within a target accuracy. It is shown that the proposed BC-rule preserves linear convergence of the unquantized algorithms, and a trade-off between convergence rate and communication cost under ANQ-based quantization is characterized. Numerical results validate our theoretical findings and show that distributed algorithms equipped with the proposed ANQ have more favorable communication cost than algorithms using state-of-the-art quantization rules.
    Optimizing Operating Points for High Performance Lesion Detection and Segmentation Using Lesion Size Reweighting. (arXiv:2107.12978v2 [eess.IV] UPDATED)
    There are many clinical contexts which require accurate detection and segmentation of all focal pathologies (e.g. lesions, tumours) in patient images. In cases where there are a mix of small and large lesions, standard binary cross entropy loss will result in better segmentation of large lesions at the expense of missing small ones. Adjusting the operating point to accurately detect all lesions generally leads to oversegmentation of large lesions. In this work, we propose a novel reweighing strategy to eliminate this performance gap, increasing small pathology detection performance while maintaining segmentation accuracy. We show that our reweighing strategy vastly outperforms competing strategies based on experiments on a large scale, multi-scanner, multi-center dataset of Multiple Sclerosis patient images.
    Testing the Robustness of a BiLSTM-based Structural Story Classifier. (arXiv:2201.02733v2 [cs.CL] UPDATED)
    The growing prevalence of counterfeit stories on the internet has fostered significant interest towards fast and scalable detection of fake news in the machine learning community. While several machine learning techniques for this purpose have emerged, we observe that there is a need to evaluate the impact of noise on these techniques' performance, where noise constitutes news articles being mistakenly labeled as fake (or real). This work takes a step in that direction, where we examine the impact of noise on a state-of-the-art, structural model based on BiLSTM (Bidirectional Long-Short Term Model) for fake news detection, Hierarchical Discourse-level Structure for Fake News Detection by Karimi and Tang (Reference no. 9).
    Linear Speedup in Personalized Collaborative Learning. (arXiv:2111.05968v3 [cs.LG] UPDATED)
    Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In this work, we formalize the personalized collaborative learning problem as a stochastic optimization of a task $0$ while given access to $N$ related but different tasks $1,\dots, N$. We give convergence guarantees for two algorithms in this setting -- a popular collaboration method known as \emph{weighted gradient averaging}, and a novel \emph{bias correction} method -- and explore conditions under which we can achieve linear speedup w.r.t. the number of auxiliary tasks $N$. Further, we also empirically study their performance confirming our theoretical insights.
    Learning Selective Sensor Fusion for States Estimation. (arXiv:1912.13077v2 [cs.CV] UPDATED)
    Autonomous vehicles and mobile robotic systems are typically equipped with multiple sensors to provide redundancy. By integrating the observations from different sensors, these mobile agents are able to perceive the environment and estimate system states, e.g. locations and orientations. Although deep learning approaches for multimodal odometry estimation and localization have gained traction, they rarely focus on the issue of robust sensor fusion - a necessary consideration to deal with noisy or incomplete sensor observations in the real world. Moreover, current deep odometry models suffer from a lack of interpretability. To this extent, we propose SelectFusion, an end-to-end selective sensor fusion module which can be applied to useful pairs of sensor modalities such as monocular images and inertial measurements, depth images and LIDAR point clouds. Our model is a uniform framework that is not restricted to specific modality or task. During prediction, the network is able to assess the reliability of the latent features from different sensor modalities and estimate trajectory both at scale and global pose. In particular, we propose two fusion modules - a deterministic soft fusion and a stochastic hard fusion, and offer a comprehensive study of the new strategies compared to trivial direct fusion. We extensively evaluate all fusion strategies in both public datasets and on progressively degraded datasets that present synthetic occlusions, noisy and missing data and time misalignment between sensors, and we investigate the effectiveness of the different fusion strategies in attending the most reliable features, which in itself, provides insights into the operation of the various models.
    Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks. (arXiv:2005.09147v8 [cs.CV] UPDATED)
    Deep neural networks (DNNs) are vulnerable to adversarial noises. By adding adversarial noises to training samples, adversarial training can improve the model's robustness against adversarial noises. However, adversarial training samples with excessive noises can harm standard accuracy, which may be unacceptable for many medical image analysis applications. This issue has been termed the trade-off between standard accuracy and adversarial robustness. In this paper, we hypothesize that this issue may be alleviated if the adversarial samples for training are placed right on the decision boundaries. Based on this hypothesis, we design an adaptive adversarial training method, named IMA. For each individual training sample, IMA makes a sample-wise estimation of the upper bound of the adversarial perturbation. In the training process, each of the sample-wise adversarial perturbations is gradually increased to match the margin. Once an equilibrium state is reached, the adversarial perturbations will stop increasing. IMA is evaluated on publicly available datasets under two popular adversarial attacks, PGD and IFGSM. The results show that: (1) IMA significantly improves adversarial robustness of DNN classifiers, which achieves the state-of-the-art performance; (2) IMA has a minimal reduction in clean accuracy among all competing defense methods; (3) IMA can be applied to pretrained models to reduce time cost; (4) IMA can be applied to the state-of-the-art medical image segmentation networks, with outstanding performance. We hope our work may help to lift the trade-off between adversarial robustness and clean accuracy and facilitate the development of robust applications in the medical field. The source code will be released when this paper is published.
    Physics-informed Guided Disentanglement in Generative Networks. (arXiv:2107.14229v2 [cs.CV] UPDATED)
    Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we build upon collection of simple physics models and present a comprehensive method for disentangling visual traits in target images, guiding the process with a physical model that renders some of the target traits, and learning the remaining ones. Because it allows explicit and interpretable outputs, our physical models (optimally regressed on target) allows generating unseen scenarios in a controllable manner. We also extend our framework, showing versatility to neural-guided disentanglement. The results show our disentanglement strategies dramatically increase performances qualitatively and quantitatively in several challenging scenarios for image translation.
    Link Scheduling using Graph Neural Networks. (arXiv:2109.05536v2 [eess.SP] UPDATED)
    Efficient scheduling of transmissions is a key problem in wireless networks. The main challenge stems from the fact that optimal link scheduling involves solving a maximum weighted independent set (MWIS) problem, which is known to be NP-hard. In practical schedulers, centralized and distributed greedy heuristics are commonly used to approximately solve the MWIS problem. However, these greedy heuristics mostly ignore important topological information of the wireless network. To overcome this limitation, we propose fast heuristics based on graph convolutional networks (GCNs) that can be implemented in centralized and distributed manners. Our centralized heuristic is based on tree search guided by a GCN and 1-step rollout. In our distributed MWIS solver, a GCN generates topology-aware node embeddings that are combined with per-link utilities before invoking a distributed greedy solver. Moreover, a novel reinforcement learning scheme is developed to train the GCN in a non-differentiable pipeline. Test results on medium-sized wireless networks show that our centralized heuristic can reach a near-optimal solution quickly, and our distributed heuristic based on a shallow GCN can reduce by nearly half the suboptimality gap of the distributed greedy solver with minimal increase in complexity. The proposed schedulers also exhibit good generalizability across graph and weight distributions.
    PocketNet: A Smaller Neural Network for Medical Image Analysis. (arXiv:2104.10745v3 [eess.IV] UPDATED)
    Medical imaging deep learning models are often large and complex, requiring specialized hardware to train and evaluate these models. To address such issues, we propose the PocketNet paradigm to reduce the size of deep learning models by throttling the growth of the number of channels in convolutional neural networks. We demonstrate that, for a range of segmentation and classification tasks, PocketNet architectures produce results comparable to that of conventional neural networks while reducing the number of parameters by multiple orders of magnitude, using up to 90% less GPU memory, and speeding up training times by up to 40%, thereby allowing such models to be trained and deployed in resource-constrained settings.
    Cohort Bias Adaptation in Aggregated Datasets for Lesion Segmentation. (arXiv:2108.00713v2 [eess.IV] UPDATED)
    Many automatic machine learning models developed for focal pathology (e.g. lesions, tumours) detection and segmentation perform well, but do not generalize as well to new patient cohorts, impeding their widespread adoption into real clinical contexts. One strategy to create a more diverse, generalizable training set is to naively pool datasets from different cohorts. Surprisingly, training on this \it{big data} does not necessarily increase, and may even reduce, overall performance and model generalizability, due to the existence of cohort biases that affect label distributions. In this paper, we propose a generalized affine conditioning framework to learn and account for cohort biases across multi-source datasets, which we call Source-Conditioned Instance Normalization (SCIN). Through extensive experimentation on three different, large scale, multi-scanner, multi-centre Multiple Sclerosis (MS) clinical trial MRI datasets, we show that our cohort bias adaptation method (1) improves performance of the network on pooled datasets relative to naively pooling datasets and (2) can quickly adapt to a new cohort by fine-tuning the instance normalization parameters, thus learning the new cohort bias with only 10 labelled samples.
    Assisted Learning for Organizations with Limited Data. (arXiv:2109.09307v3 [cs.LG] UPDATED)
    We develop an assisted learning framework for assisting organization-level learners to improve their learning performance with limited and imbalanced data. In particular, learners at the organization level usually have sufficient computation resource, but are subject to stringent collaboration policy and information privacy. Their limited imbalanced data often cause biased inference and sub-optimal decision-making. In our assisted learning framework, an organizational learner purchases assistance service from a service provider and aims to enhance its model performance within a few assistance rounds. We develop effective stochastic training algorithms for assisted deep learning and assisted reinforcement learning. Different from existing distributed algorithms that need to frequently transmit gradients or models, our framework allows the learner to only occasionally share information with the service provider, and still achieve a near-oracle model as if all the data were centralized.
    Masked Autoencoders As Spatiotemporal Learners. (arXiv:2205.09113v1 [cs.CV])
    This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
    Efficient PAC Reinforcement Learning in Regular Decision Processes. (arXiv:2105.06784v3 [cs.AI] UPDATED)
    Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.
    Phy-Q: A Testbed for Physical Reasoning. (arXiv:2108.13696v2 [cs.AI] UPDATED)
    Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. For each scenario, we create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark
    GSN: A Graph Neural Network Inspired by Spring Network. (arXiv:2201.12994v3 [cs.LG] UPDATED)
    The design of Graph Neural Networks (GNNs) that operate on both homophilous and heterophilous graphs has received research attention in recent years. Existing heterophilous GNNs, particularly those designed in the spatial domain, lack a convincing theoretical or physical motivation. Inspired by an old-fashioned spring network model, we propose the Graph Spring Network (GSN), a universal GNN model that works for homophilous and heterophilous graphs. We show that the GSN framework can interpret many GNN models from the perspective of potential energy minimization of a spring network with respect to various metrics, which entrusts strong physical motivations to these models. We also conduct experiments to demonstrate the performance of our GSN model on real-world datasets.
    Single-Shot Optical Neural Network. (arXiv:2205.09103v1 [cs.ET])
    As deep neural networks (DNNs) grow to solve increasingly complex problems, they are becoming limited by the latency and power consumption of existing digital processors. 'Weight-stationary' analog optical and electronic hardware has been proposed to reduce the compute resources required by DNNs by eliminating expensive weight updates; however, with scalability limited to an input vector length $K$ of hundreds of elements. Here, we present a scalable, single-shot-per-layer weight-stationary optical processor that leverages the advantages of free-space optics for passive optical copying and large-scale distribution of an input vector and integrated optoelectronics for static, reconfigurable weighting and the nonlinearity. We propose an optimized near-term CMOS-compatible system with $K = 1,000$ and beyond, and we calculate its theoretical total latency ($\sim$10 ns), energy consumption ($\sim$10 fJ/MAC) and throughput ($\sim$petaMAC/s) per layer. We also experimentally test DNN classification accuracy with single-shot analog optical encoding, copying and weighting of the MNIST handwritten digit dataset in a proof-of-concept system, achieving 94.7% (similar to the ground truth accuracy of 96.3%) without retraining on the hardware or data preprocessing. Lastly, we determine the upper bound on throughput of our system ($\sim$0.9 exaMAC/s), set by the maximum optical bandwidth before significant loss of accuracy. This joint use of wide spectral and spatial bandwidths enables highly efficient computing for next-generation DNNs.
    Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement. (arXiv:1810.09103v3 [cs.LG] UPDATED)
    Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.
    On the Efficiency of Entropic Regularized Algorithms for Optimal Transport. (arXiv:1906.01437v9 [cs.DS] UPDATED)
    We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.
    Representation Learning for Content-Sensitive Anomaly Detection in Industrial Networks. (arXiv:2205.08953v1 [cs.LG])
    Using a convGRU-based autoencoder, this thesis proposes a framework to learn spatial-temporal aspects of raw network traffic in an unsupervised and protocol-agnostic manner. The learned representations are used to measure the effect on the results of a subsequent anomaly detection and are compared to the application without the extracted features. The evaluation showed, that the anomaly detection could not effectively be enhanced when applied on compressed traffic fragments for the context of network intrusion detection. Yet, the trained autoencoder successfully generates a compressed representation (code) of the network traffic, which hold spatial and temporal information. Based on the models residual loss, the autoencoder is also capable of detecting anomalies by itself. Lastly, an approach for a kind of model interpretability (LRP) was investigated in order to identify relevant areas within the raw input data, which is used to enrich alerts generated by an anomaly detection method.
    Meta-Learning Sparse Compression Networks. (arXiv:2205.08957v1 [stat.ML])
    Recent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations (INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ stateof-the-art network sparsification techniques to drastically improve compression. Secondly, introduce the first method allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, drastically improving both compression and the computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities such as images, manifolds, signed distance functions, 3D shapes and scenes, several of which establish new state-of-the-art results.
    A weakly supervised framework for high-resolution crop yield forecasts. (arXiv:2205.09016v1 [cs.LG])
    Predictor inputs and label data for crop yield forecasting are not always available at the same spatial resolution. We propose a deep learning framework that uses high resolution inputs and low resolution labels to produce crop yield forecasts for both spatial levels. The forecasting model is calibrated by weak supervision from low resolution crop area and yield statistics. We evaluated the framework by disaggregating regional yields in Europe from parent statistical regions to sub-regions for five countries (Germany, Spain, France, Hungary, Italy) and two crops (soft wheat and potatoes). Performance of weakly supervised models was compared with linear trend models and Gradient-Boosted Decision Trees (GBDT). Higher resolution crop yield forecasts are useful to policymakers and other stakeholders. Weakly supervised deep learning methods provide a way to produce such forecasts even in the absence of high resolution yield data.
    Medical Deep Learning -- A systematic Meta-Review. (arXiv:2010.14881v5 [eess.IV] UPDATED)
    Deep learning (DL) has remarkably impacted several different scientific disciplines over the last few years. E.g., in image processing and analysis, DL algorithms were able to outperform other cutting-edge methods. Additionally, DL has delivered state-of-the-art results in tasks like autonomous driving, outclassing previous attempts. There are even instances where DL outperformed humans, for example with object recognition and gaming. DL is also showing vast potential in the medical domain. With the collection of large quantities of patient records and data, and a trend towards personalized treatments, there is a great need for automated and reliable processing and analysis of health information. Patient data is not only collected in clinical centers, like hospitals and private practices, but also by mobile healthcare apps or online websites. The abundance of collected patient data and the recent growth in the DL field has resulted in a large increase in research efforts. In Q2/2020, the search engine PubMed returned already over 11,000 results for the search term 'deep learning', and around 90% of these publications are from the last three years. However, even though PubMed represents the largest search engine in the medical field, it does not cover all medical-related publications. Hence, a complete overview of the field of 'medical deep learning' is almost impossible to obtain and acquiring a full overview of medical sub-fields is becoming increasingly more difficult. Nevertheless, several review and survey articles about medical DL have been published within the last few years. They focus, in general, on specific medical scenarios, like the analysis of medical images containing specific pathologies. With these surveys as a foundation, the aim of this article is to provide the first high-level, systematic meta-review of medical DL surveys.
    The Kernelized Taylor Diagram. (arXiv:2205.08864v1 [stat.ML])
    This paper presents the kernelized Taylor diagram, a graphical framework for visualizing similarities between data populations. The kernelized Taylor diagram builds on the widely used Taylor diagram, which is used to visualize similarities between populations. However, the Taylor diagram has several limitations such as not capturing non-linear relationships and sensitivity to outliers. To address such limitations, we propose the kernelized Taylor diagram. Our proposed kernelized Taylor diagram is capable of visualizing similarities between populations with minimal assumptions of the data distributions. The kernelized Taylor diagram relates the maximum mean discrepancy and the kernel mean embedding in a single diagram, a construction that, to the best of our knowledge, have not been devised prior to this work. We believe that the kernelized Taylor diagram can be a valuable tool in data visualization.
    FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting. (arXiv:2205.08897v1 [cs.LG])
    Recent studies have shown the promising performance of deep learning models (e.g., RNN and Transformer) for long-term time series forecasting. These studies mostly focus on designing deep models to effectively combine historical information for long-term forecasting. However, the question of how to effectively represent historical information for long-term forecasting has not received enough attention, limiting our capacity to exploit powerful deep learning models. The main challenge in time series representation is how to handle the dilemma between accurately preserving historical information and reducing the impact of noisy signals in the past. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM} for short: it introduces Legendre Polynomial projections to preserve historical information accurately and Fourier projections plus low-rank approximation to remove noisy signals. Our empirical studies show that the proposed FiLM improves the accuracy of state-of-the-art models by a significant margin (\textbf{19.2\%}, \textbf{22.6\%}) in multivariate and univariate long-term forecasting, respectively. In addition, dimensionality reduction introduced by low-rank approximation leads to a dramatic improvement in computational efficiency. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the performance of most deep learning modules for long-term forecasting. Code will be released soon
    Fast Neural Network based Solving of Partial Differential Equations. (arXiv:2205.08978v1 [cs.LG])
    We present a novel method for using Neural Networks (NNs) for finding solutions to a class of Partial Differential Equations (PDEs). Our method builds on recent advances in Neural Radiance Field research (NeRFs) and allows for a NN to converge to a PDE solution much faster than classic Physically Informed Neural Network (PINNs) approaches.
    Learning Shared Kernel Models: the Shared Kernel EM algorithm. (arXiv:2205.09041v1 [cs.LG])
    Expectation maximisation (EM) is an unsupervised learning method for estimating the parameters of a finite mixture distribution. It works by introducing "hidden" or "latent" variables via Baum's auxiliary function $Q$ that allow the joint data likelihood to be expressed as a product of simple factors. The relevance of EM has increased since the introduction of the variational lower bound (VLB): the VLB differs from Baum's auxiliary function only by the entropy of the PDF of the latent variables $Z$. We first present a rederivation of the standard EM algorithm using data association ideas from the field of multiple target tracking, using $K$-valued scalar data association hypotheses rather than the usual binary indicator vectors. The same method is then applied to a little known but much more general type of supervised EM algorithm for shared kernel models, related to probabilistic radial basis function networks. We address a number of shortcomings in the derivations that have been published previously in this area. In particular, we give theoretically rigorous derivations of (i) the complete data likelihood; (ii) Baum's auxiliary function (the E-step) and (iii) the maximisation (M-step) in the case of Gaussian shared kernel models. The subsequent algorithm, called shared kernel EM (SKEM), is then applied to a digit recognition problem using a novel 7-segment digit representation. Variants of the algorithm that use different numbers of features and different EM algorithm dimensions are compared in terms of mean accuracy and mean IoU. A simplified classifier is proposed that decomposes the joint data PDF as a product of lower order PDFs over non-overlapping subsets of variables. The effect of different numbers of assumed mixture components $K$ is also investigated. High-level source code for the data generation and SKEM algorithm is provided.
    Unsupervised Features Ranking via Coalitional Game Theory for Categorical Data. (arXiv:2205.09060v1 [cs.LG])
    Not all real-world data are labeled, and when labels are not available, it is often costly to obtain them. Moreover, as many algorithms suffer from the curse of dimensionality, reducing the features in the data to a smaller set is often of great utility. Unsupervised feature selection aims to reduce the number of features, often using feature importance scores to quantify the relevancy of single features to the task at hand. These scores can be based only on the distribution of variables and the quantification of their interactions. The previous literature, mainly investigating anomaly detection and clusters, fails to address the redundancy-elimination issue. We propose an evaluation of correlations among features to compute feature importance scores representing the contribution of single features in explaining the dataset's structure. Based on Coalitional Game Theory, our feature importance scores include a notion of redundancy awareness making them a tool to achieve redundancy-free feature selection. We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate while maximizing the information contained in the data. We also introduce an approximated version of the algorithm to reduce the complexity of Shapley values' computations.
    Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation. (arXiv:2205.09029v1 [stat.ML])
    Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow's hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
    Moderately Supervised Learning: Definition, Framework and Generality. (arXiv:2008.11945v3 [cs.CV] UPDATED)
    Learning with supervision has achieved remarkable success in numerous artificial intelligence (AI) applications. In the current literature, by referring to the properties of the labels prepared for the training data set, learning with supervision is categorized as supervised learning (SL) and weakly supervised learning (WSL). SL concerns the situation where the training data set is assigned with ideal labels, while WSL concerns the situation where the training data set is assigned with non-ideal labels. However, without considering the properties of the transformation from the given labels to learnable targets, the definition of SL is relatively abstract, which conceals some details that can be critical to building the appropriate solutions for specific SL tasks. Thus, it is desirable to reveal these details more concretely. This article attempts to achieve this goal by expanding the categorization of SL and investigating the sub-type that plays the central role in SL. More specifically, taking into consideration the properties of the transformation from the given labels to learnable targets, we firstly categorize SL into three narrower sub-types. Then we focus on the moderately supervised learning (MSL) sub-type that concerns the situation where the given labels are ideal, but due to the simplicity in annotation, careful designs are required to transform the given labels into learnable targets. From the perspectives of the definition, framework and generality, we comprehensively illustrate MSL and reveal what details are concealed by the abstractness of the definition of SL. At the meantime, the whole presentation of this paper as well establishes a tutorial for AI application engineers to refer to viewing a problem to be solved from the mathematicians' vision.
    FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning. (arXiv:2107.04271v5 [cs.DC] UPDATED)
    Applying Federated Learning (FL) on Internet-of-Things devices is necessitated by the large volumes of data they produce and growing concerns of data privacy. However, there are three challenges that need to be addressed to make FL efficient: (i) execution on devices with limited computational capabilities, (ii) accounting for stragglers due to computational heterogeneity of devices, and (iii) adaptation to the changing network bandwidths. This paper presents FedAdapt, an adaptive offloading FL framework to mitigate the aforementioned challenges. FedAdapt accelerates local training in computationally constrained devices by leveraging layer offloading of deep neural networks (DNNs) to servers. Further, FedAdapt adopts reinforcement learning based optimization and clustering to adaptively identify which layers of the DNN should be offloaded for each individual device on to a server to tackle the challenges of computational heterogeneity and changing network bandwidth. Experimental studies are carried out on a lab-based testbed and it is demonstrated that by offloading a DNN from the device to the server FedAdapt reduces the training time of a typical IoT device by over half compared to classic FL. The training time of extreme stragglers and the overall training time can be reduced by up to 57%. Furthermore, with changing network bandwidth, FedAdapt is demonstrated to reduce the training time by up to 40% when compared to classic FL, without sacrificing accuracy.
    SOUL: An Energy-Efficient Unsupervised Online Learning Seizure Detection Classifier. (arXiv:2110.02169v2 [eess.SP] UPDATED)
    Implantable devices that record neural activity and detect seizures have been adopted to issue warnings or trigger neurostimulation to suppress epileptic seizures. Typical seizure detection systems rely on high-accuracy offline-trained machine learning classifiers that require manual retraining when seizure patterns change over long periods of time. For an implantable seizure detection system, a low power, at-the-edge, online learning algorithm can be employed to dynamically adapt to the neural signal drifts, thereby maintaining high accuracy without external intervention. This work proposes SOUL: Stochastic-gradient-descent-based Online Unsupervised Logistic regression classifier. After an initial offline training phase, continuous online unsupervised classifier updates are applied in situ, which improves sensitivity in patients with drifting seizure features. SOUL was tested on two human electroencephalography (EEG) datasets: the CHB-MIT scalp EEG dataset, and a long (>100 hours) NeuroVista intracranial EEG dataset. It was able to achieve an average sensitivity of 97.5% and 97.9% for the two datasets respectively, at >95% specificity. Sensitivity improved by at most 8.2% on long-term data when compared to a typical seizure detection classifier. SOUL was fabricated in TSMC's 28 nm process occupying 0.1 mm2 and achieves 1.5 nJ/classification energy efficiency, which is at least 24x more efficient than state-of-the-art.
    SoQal: Selective Oracle Questioning for Consistency Based Active Learning of Cardiac Signals. (arXiv:2004.09557v3 [cs.LG] UPDATED)
    Clinical settings are often characterized by abundant unlabelled data and limited labelled data. This is typically driven by the high burden placed on oracles (e.g., physicians) to provide annotations. One way to mitigate this burden is via active learning (AL) which involves the (a) acquisition and (b) annotation of informative unlabelled instances. Whereas previous work addresses either one of these elements independently, we propose an AL framework that addresses both. For acquisition, we propose Bayesian Active Learning by Consistency (BALC), a sub-framework which perturbs both instances and network parameters and quantifies changes in the network output probability distribution. For annotation, we propose SoQal, a sub-framework that dynamically determines whether, for each acquired unlabelled instance, to request a label from an oracle or to pseudo-label it instead. We show that BALC can outperform start-of-the-art acquisition functions such as BALD, and SoQal outperforms baseline methods even in the presence of a noisy oracle.
    SoK: The Impact of Unlabelled Data in Cyberthreat Detection. (arXiv:2205.08944v1 [cs.CR])
    Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of ML-based CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data. This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for evaluation of SsL methods, which elucidates the contribution of unlabelled data. We review the state-of-the-art and observe that no previous work meets such requirements. To address this problem, we propose a framework for assessing the benefits of unlabelled data in SsL. We showcase an application of this framework by performing the first benchmark evaluation that highlights the tradeoffs of 9 existing SsL methods on 9 public datasets. Our findings verify that, in some cases, unlabelled data provides a small, but statistically significant, performance gain. This paper highlights that SsL in CTD has a lot of room for improvement, which should stimulate future research in this field.
    Entity Alignment with Reliable Path Reasoning and Relation-Aware Heterogeneous Graph Transformer. (arXiv:2205.08806v1 [cs.CL])
    Entity Alignment (EA) has attracted widespread attention in both academia and industry, which aims to seek entities with same meanings from different Knowledge Graphs (KGs). There are substantial multi-step relation paths between entities in KGs, indicating the semantic relations of entities. However, existing methods rarely consider path information because not all natural paths facilitate for EA judgment. In this paper, we propose a more effective entity alignment framework, RPR-RHGT, which integrates relation and path structure information, as well as the heterogeneous information in KGs. Impressively, an initial reliable path reasoning algorithm is developed to generate the paths favorable for EA task from the relation structures of KGs, which is the first algorithm in the literature to successfully use unrestricted path information. In addition, to efficiently capture heterogeneous features in entity neighborhoods, a relation-aware heterogeneous graph transformer is designed to model the relation and path structures of KGs. Extensive experiments on three well-known datasets show RPR-RHGT significantly outperforms 11 state-of-the-art methods, exceeding the best performing baseline up to 8.62% on Hits@1. We also show its better performance than the baselines on different ratios of training set, and harder datasets.
    Automating In-Network Machine Learning. (arXiv:2205.08824v1 [cs.NI])
    Using programmable network devices to aid in-network machine learning has been the focus of significant research. However, most of the research was of a limited scope, providing a proof of concept or describing a closed-source algorithm. To date, no general solution has been provided for mapping machine learning algorithms to programmable network devices. In this paper, we present Planter, an open-source, modular framework for mapping trained machine learning models to programmable devices. Planter supports a wide range of machine learning models, multiple targets and can be easily extended. The evaluation of Planter compares different mapping approaches, and demonstrates the feasibility, performance, and resource efficiency for applications such as anomaly detection, financial transactions, and quality of experience. The results show that Planter-based in-network machine learning algorithms can run at line rate, have a negligible effect on latency, coexist with standard switching functionality, and have no or minor accuracy trade-offs.
    DL4DS -- Deep Learning for empirical DownScaling. (arXiv:2205.08967v1 [cs.LG])
    A common task in Earth Sciences is to infer climate information at local and regional scales from global climate models. Dynamical downscaling requires running expensive numerical models at high resolution which can be prohibitive due to long model runtimes. On the other hand, statistical downscaling techniques present an alternative approach for learning links between the large- and local-scale climate in a more efficient way. A large number of deep neural network-based approaches for statistical downscaling have been proposed in recent years, mostly based on convolutional architectures developed for computer vision and super-resolution tasks. This paper presents DL4DS, Deep Learning for empirical DownScaling, a python library that implements a wide variety of state-of-the-art and novel algorithms for downscaling gridded Earth Science data with deep neural networks. DL4DS has been designed with the goal of providing a general framework for training convolutional neural networks with configurable architectures and learning strategies to facilitate the conduction of comparative and ablation studies in a robust way. We showcase the capabilities of DL4DS on air quality CAMS data over the western Mediterranean area. The DL4DS library can be found in this repository: https://github.com/carlos-gg/dl4ds
    One Explanation to Rule them All -- Ensemble Consistent Explanations. (arXiv:2205.08974v1 [cs.AI])
    Transparency is a major requirement of modern AI based decision making systems deployed in real world. A popular approach for achieving transparency is by means of explanations. A wide variety of different explanations have been proposed for single decision making systems. In practice it is often the case to have a set (i.e. ensemble) of decisions that are used instead of a single decision only, in particular in complex systems. Unfortunately, explanation methods for single decision making systems are not easily applicable to ensembles -- i.e. they would yield an ensemble of individual explanations which are not necessarily consistent, hence less useful and more difficult to understand than a single consistent explanation of all observed phenomena. We propose a novel concept for consistently explaining an ensemble of decisions locally with a single explanation -- we introduce a formal concept, as well as a specific implementation using counterfactual explanations.
    Price Interpretability of Prediction Markets: A Convergence Analysis. (arXiv:2205.08913v1 [q-fin.TR])
    Prediction markets are long known for prediction accuracy. However, there is still a lack of systematic understanding of how prediction markets aggregate information and why they work so well. This work proposes a multivariate utility (MU)-based mechanism that unifies several existing prediction market-making schemes. Based on this mechanism, we derive convergence results for markets with myopic, risk-averse traders who repeatedly interact with the market maker. We show that the resulting limiting wealth distribution lies on the Pareto efficient frontier defined by all market participants' utilities. With the help of this result, we establish both analytical and numerical results for the limiting price for different market models. We show that the limiting price converges to the geometric mean of agents' beliefs for exponential utility-based markets. For risk measure-based markets, we construct a risk measure family that meets the convergence requirements and show that the limiting price can converge to a weighted power mean of agent beliefs. For markets based on hyperbolic absolute risk aversion (HARA) utilities, we show that the limiting price is also a risk-adjusted weighted power mean of agent beliefs, even though the trading order will affect the aggregation weights. We further propose an approximation scheme for the limiting price under the HARA utility family. We show through numerical experiments that our approximation scheme works well in predicting the convergent prices.
    POViT: Vision Transformer for Multi-objective Design and Characterization of Nanophotonic Devices. (arXiv:2205.09045v1 [cs.LG])
    We solve a fundamental challenge in semiconductor IC design: the fast and accurate characterization of nanoscale photonic devices. Much like the fusion between AI and EDA, many efforts have been made to apply DNNs such as convolutional neural networks (CNN) to prototype and characterize next-gen optoelectronic devices commonly found in photonic integrated circuits (PIC) and LiDAR. These prior works generally strive to predict the quality factor (Q) and modal volume (V) of for instance, photonic crystals, with ultra-high accuracy and speed. However, state-of-the-art models are still far from being directly applicable in the real-world: e.g. the correlation coefficient of V ($V_{coeff}$ ) is only about 80%, which is much lower than what it takes to generate reliable and reproducible nanophotonic designs. Recently, attention-based transformer models have attracted extensive interests and been widely used in CV and NLP. In this work, we propose the first-ever Transformer model (POViT) to efficiently design and simulate semiconductor photonic devices with multiple objectives. Unlike the standard Vision Transformer (ViT), we supplied photonic crystals as data input and changed the activation layer from GELU to an absolute-value function (ABS). Our experiments show that POViT exceeds results reported by previous models significantly. The correlation coefficient $V_{coeff}$ increases by over 12% (i.e., to 92.0%) and the prediction errors of Q is reduced by an order of magnitude, among several other key metric improvements. Our work has the potential to drive the expansion of EDA to fully automated photonic design. The complete dataset and code will be released to aid researchers endeavoring in the interdisciplinary field of physics and computer science.
    Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs. (arXiv:2205.09056v1 [cs.LG])
    Reinforcement learning (RL) generalizes bandit problems with additional difficulties on longer planning horzion and unknown transition kernel. We show that, under some mild assumptions, \textbf{any} slowly changing adversarial bandit algorithm enjoys near-optimal regret in adversarial bandits can achieve near-optimal (expected) regret in non-episodic discounted MDPs. The slowly changing property required by our generalization is mild, see e.g. (Even-Dar et al. 2009, Neu et al. 2010), we also show, for example, \expt~(Auer et al. 2002) is slowly changing and enjoys near-optimal regret in MDPs.
    Multilayer Perceptron Based Stress Evolution Analysis under DC Current Stressing for Multi-segment Wires. (arXiv:2205.09065v1 [cs.LG])
    Electromigration (EM) is one of the major concerns in the reliability analysis of very large scale integration (VLSI) systems due to the continuous technology scaling. Accurately predicting the time-to-failure of integrated circuits (IC) becomes increasingly important for modern IC design. However, traditional methods are often not sufficiently accurate, leading to undesirable over-design especially in advanced technology nodes. In this paper, we propose an approach using multilayer perceptrons (MLP) to compute stress evolution in the interconnect trees during the void nucleation phase. The availability of a customized trial function for neural network training holds the promise of finding dynamic mesh-free stress evolution on complex interconnect trees under time-varying temperatures. Specifically, we formulate a new objective function considering the EM-induced coupled partial differential equations (PDEs), boundary conditions (BCs), and initial conditions to enforce the physics-based constraints in the spatial-temporal domain. The proposed model avoids meshing and reduces temporal iterations compared with conventional numerical approaches like FEM. Numerical results confirm its advantages on accuracy and computational performance.
    Multi-disciplinary fairness considerations in machine learning for clinical trials. (arXiv:2205.08875v1 [cs.LG])
    While interest in the application of machine learning to improve healthcare has grown tremendously in recent years, a number of barriers prevent deployment in medical practice. A notable concern is the potential to exacerbate entrenched biases and existing health disparities in society. The area of fairness in machine learning seeks to address these issues of equity; however, appropriate approaches are context-dependent, necessitating domain-specific consideration. We focus on clinical trials, i.e., research studies conducted on humans to evaluate medical treatments. Clinical trials are a relatively under-explored application in machine learning for healthcare, in part due to complex ethical, legal, and regulatory requirements and high costs. Our aim is to provide a multi-disciplinary assessment of how fairness for machine learning fits into the context of clinical trials research and practice. We start by reviewing the current ethical considerations and guidelines for clinical trials and examine their relationship with common definitions of fairness in machine learning. We examine potential sources of unfairness in clinical trials, providing concrete examples, and discuss the role machine learning might play in either mitigating potential biases or exacerbating them when applied without care. Particular focus is given to adaptive clinical trials, which may employ machine learning. Finally, we highlight concepts that require further investigation and development, and emphasize new approaches to fairness that may be relevant to the design of clinical trials.
    Deep Features for CBIR with Scarce Data using Hebbian Learning. (arXiv:2205.08935v1 [cs.CV])
    Features extracted from Deep Neural Networks (DNNs) have proven to be very effective in the context of Content Based Image Retrieval (CBIR). In recent work, biologically inspired \textit{Hebbian} learning algorithms have shown promises for DNN training. In this contribution, we study the performance of such algorithms in the development of feature extractors for CBIR tasks. Specifically, we consider a semi-supervised learning strategy in two steps: first, an unsupervised pre-training stage is performed using Hebbian learning on the image dataset; second, the network is fine-tuned using supervised Stochastic Gradient Descent (SGD) training. For the unsupervised pre-training stage, we explore the nonlinear Hebbian Principal Component Analysis (HPCA) learning rule. For the supervised fine-tuning stage, we assume sample efficiency scenarios, in which the amount of labeled samples is just a small fraction of the whole dataset. Our experimental analysis, conducted on the CIFAR10 and CIFAR100 datasets shows that, when few labeled samples are available, our Hebbian approach provides relevant improvements compared to various alternative methods.
    One-way Explainability Isn't The Message. (arXiv:2205.08954v1 [cs.LG])
    Recent engineering developments in specialised computational hardware, data-acquisition and storage technology have seen the emergence of Machine Learning (ML) as a powerful form of data analysis with widespread applicability beyond its historical roots in the design of autonomous agents. However -- possibly because of its origins in the development of agents capable of self-discovery -- relatively little attention has been paid to the interaction between people and ML. In this paper we are concerned with the use of ML in automated or semi-automated tools that assist one or more human decision makers. We argue that requirements on both human and machine in this context are significantly different to the use of ML either as part of autonomous agents for self-discovery or as part statistical data analysis. Our principal position is that the design of such human-machine systems should be driven by repeated, two-way intelligibility of information rather than one-way explainability of the ML-system's recommendations. Iterated rounds of intelligible information exchange, we think, will characterise the kinds of collaboration that will be needed to understand complex phenomena for which neither man or machine have complete answers. We propose operational principles -- we call them Intelligibility Axioms -- to guide the design of a collaborative decision-support system. The principles are concerned with: (a) what it means for information provided by the human to be intelligible to the ML system; and (b) what it means for an explanation provided by an ML system to be intelligible to a human. Using examples from the literature on the use of ML for drug-design and in medicine, we demonstrate cases where the conditions of the axioms are met. We describe some additional requirements needed for the design of a truly collaborative decision-support system.
    Predicting failure characteristics of structural materials via deep learning based on nondestructive void topology. (arXiv:2205.09075v1 [cond-mat.mtrl-sci])
    Accurate predictions of the failure progression of structural materials is critical for preventing failure-induced accidents. Despite considerable mechanics modeling-based efforts, accurate prediction remains a challenging task in real-world environments due to unexpected damage factors and defect evolutions. Here, we report a novel method for predicting material failure characteristics that uniquely combines nondestructive X-ray computed tomography (X-CT), persistent homology (PH), and deep multimodal learning (DML). The combined method exploits the microstructural defect state at the time of material examination as an input, and outputs the failure-related properties. Our method is demonstrated to be effective using two types of fracture datasets (tensile and fatigue datasets) with ferritic low alloy steel as a representative structural material. The method achieves a mean absolute error (MAE) of 0.09 in predicting the local strain with the tensile dataset and an MAE of 0.14 in predicting the fracture progress with the fatigue dataset. These high accuracies are mainly due to PH processing of the X-CT images, which transforms complex and noisy three-dimensional X-CT images into compact two-dimensional persistence diagrams that preserve key topological features such as the internal void size, density, and distribution. The combined PH and DML processing of 3D X-CT data is our unique approach enabling reliable failure predictions at the time of material examination based on void topology progressions, and the method can be extended to various nondestructive failure tests for practical use.
    On-device modeling of user's social context and familiar places from smartphone-embedded sensor data. (arXiv:2205.08790v1 [cs.LG])
    Context modeling and recognition represent complex tasks that allow mobile and ubiquitous computing applications to adapt to the user's situation. Current solutions mainly focus on limited context information generally processed on centralized architectures, potentially exposing users' personal data to privacy leakage, and missing personalization features. For these reasons on-device context modeling and recognition represent the current research trend in this area. Among the different information characterizing the user's context in mobile environments, social interactions and visited locations remarkably contribute to the characterization of daily life scenarios. In this paper we propose a novel, unsupervised and lightweight approach to model the user's social context and her locations based on ego networks directly on the user mobile device. Relying on this model, the system is able to extract high-level and semantic-rich context features from smartphone-embedded sensors data. Specifically, for the social context it exploits data related to both physical and cyber social interactions among users and their devices. As far as location context is concerned, we assume that it is more relevant to model the familiarity degree of a specific location for the user's context than the raw location data, both in terms of GPS coordinates and proximity devices. By using 5 real-world datasets, we assess the structure of the social and location ego networks, we provide a semantic evaluation of the proposed models and a complexity evaluation in terms of mobile computing performance. Finally, we demonstrate the relevance of the extracted features by showing the performance of 3 machine learning algorithms to recognize daily-life situations, obtaining an improvement of 3% of AUROC, 9% of Precision, and 5% in terms of Recall with respect to use only features related to physical context.
    Policy Distillation with Selective Input Gradient Regularization for Efficient Interpretability. (arXiv:2205.08685v1 [cs.LG])
    Although deep Reinforcement Learning (RL) has proven successful in a wide range of tasks, one challenge it faces is interpretability when applied to real-world problems. Saliency maps are frequently used to provide interpretability for deep neural networks. However, in the RL domain, existing saliency map approaches are either computationally expensive and thus cannot satisfy the real-time requirement of real-world scenarios or cannot produce interpretable saliency maps for RL policies. In this work, we propose an approach of Distillation with selective Input Gradient Regularization (DIGR) which uses policy distillation and input gradient regularization to produce new policies that achieve both high interpretability and computation efficiency in generating saliency maps. Our approach is also found to improve the robustness of RL policies to multiple adversarial attacks. We conduct experiments on three tasks, MiniGrid (Fetch Object), Atari (Breakout) and CARLA Autonomous Driving, to demonstrate the importance and effectiveness of our approach.
    Optimal Adaptive Prediction Intervals for Electricity Load Forecasting in Distribution Systems via Reinforcement Learning. (arXiv:2205.08698v1 [stat.AP])
    Prediction intervals offer an effective tool for quantifying the uncertainty of loads in distribution systems. The traditional central PIs cannot adapt well to skewed distributions, and their offline training fashion is vulnerable to unforeseen changes in future load patterns. Therefore, we propose an optimal PI estimation approach, which is online and adaptive to different data distributions by adaptively determining symmetric or asymmetric probability proportion pairs for quantiles. It relies on the online learning ability of reinforcement learning to integrate the two online tasks, i.e., the adaptive selection of probability proportion pairs and quantile predictions, both of which are modeled by neural networks. As such, the quality of quantiles-formed PI can guide the selection process of optimal probability proportion pairs, which forms a closed loop to improve the quality of PIs. Furthermore, to improve the learning efficiency of quantile forecasts, a prioritized experience replay strategy is proposed for online quantile regression processes. Case studies on both load and net load demonstrate that the proposed method can better adapt to data distribution compared with online central PIs method. Compared with offline-trained methods, it obtains PIs with better quality and is more robust against concept drift.
    Customizing ML Predictions for Online Algorithms. (arXiv:2205.08715v1 [cs.LG])
    A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a black-box, and redesign online algorithms to take advantage of ML predictions. In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? We explore this question in the context of the classic rent-or-buy problem, and show that incorporating optimization benchmarks in ML loss functions leads to significantly better performance, while maintaining a worst-case adversarial result when the advice is completely wrong. We support this finding both through theoretical bounds and numerical simulations.
    No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL. (arXiv:2205.08716v1 [cs.LG])
    The performance of reinforcement learning (RL) agents is sensitive to the choice of hyperparameters. In real-world settings like robotics or industrial control systems, however, testing different hyperparameter configurations directly on the environment can be financially prohibitive, dangerous, or time consuming. We propose a new approach to tune hyperparameters from offline logs of data, to fully specify the hyperparameters for an RL agent that learns online in the real world. The approach is conceptually simple: we first learn a model of the environment from the offline data, which we call a calibration model, and then simulate learning in the calibration model to identify promising hyperparameters. We identify several criteria to make this strategy effective, and develop an approach that satisfies these criteria. We empirically investigate the method in a variety of settings to identify when it is effective and when it fails.  ( 2 min )
    Hyperparameter Optimization with Neural Network Pruning. (arXiv:2205.08695v1 [cs.CV])
    Since the deep learning model is highly dependent on hyperparameters, hyperparameter optimization is essential in developing deep learning model-based applications, even if it takes a long time. As service development using deep learning models has gradually become competitive, many developers highly demand rapid hyperparameter optimization algorithms. In order to keep pace with the needs of faster hyperparameter optimization algorithms, researchers are focusing on improving the speed of hyperparameter optimization algorithm. However, the huge time consumption of hyperparameter optimization due to the high computational cost of the deep learning model itself has not been dealt with in-depth. Like using surrogate model in Bayesian optimization, to solve this problem, it is necessary to consider proxy model for a neural network (N_B) to be used for hyperparameter optimization. Inspired by the main goal of neural network pruning, i.e., high computational cost reduction and performance preservation, we presumed that the neural network (N_P) obtained through neural network pruning would be a good proxy model of N_B. In order to verify our idea, we performed extensive experiments by using CIFAR10, CFIAR100, and TinyImageNet datasets and three generally-used neural networks and three representative hyperparameter optmization methods. Through these experiments, we verified that N_P can be a good proxy model of N_B for rapid hyperparameter optimization. The proposed hyperparameter optimization framework can reduce the amount of time up to 37%.  ( 2 min )
    Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling. (arXiv:2205.08766v1 [cs.LG])
    Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing on joint predictives: we discuss online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and we propose new challenging evaluation settings using active learning and active sampling. These settings are motivated by an examination of marginal and joint predictives, their respective cross-entropies, and their place in offline and online learning. They are more realistic than previously suggested ones, building on work by Wen et al. (2021) and Osband et al. (2022), and focus on evaluating the performance of approximate BNNs in an online supervised setting. Initial experiments, however, raise questions on the feasibility of these ideas in high-dimensional parameter spaces with current BDL inference techniques, and we suggest experiments that might help shed further light on the practicality of current research for these problems. Importantly, our work highlights previously unidentified gaps in current research and the need for better approximate joint predictives.  ( 2 min )
    Spatial-Temporal Interactive Dynamic Graph Convolution Network for Traffic Forecasting. (arXiv:2205.08689v1 [cs.LG])
    Accurate traffic forecasting is essential for smart cities to achieve traffic flow control, route planning, and detection. Although many spatial-temporal methods are currently proposed, these methods are deficient in capturing the spatial-temporal dependence of traffic data synchronously. In addition, most of the methods ignore the dynamically changing correlations between road network nodes that arise as traffic data changes. To address the above challenges, we propose a neural network-based Spatial-Temporal Interactive Dynamic Graph Convolutional Network (STIDGCN) for traffic forecasting in this paper. In STIDGCN, we propose an interactive dynamic graph convolution structure, which first divides the sequences at intervals and captures the spatial-temporal dependence of the traffic data simultaneously through an interactive learning strategy for effective long-term prediction. We propose a novel dynamic graph convolution module consisting of a graph generator, fusion graph convolution. The dynamic graph convolution module can use the input traffic data, pre-defined graph structure to generate a graph structure and fuse it with the defined adaptive adjacency matrix, which is used to achieve the filling of the pre-defined graph structure and simulate the generation of dynamic associations between nodes in the road network. Extensive experiments on four real-world traffic flow datasets demonstrate that STIDGCN outperforms the state-of-the-art baseline.  ( 2 min )
    Deep-learned orthogonal basis patterns for fast, noise-robust single-pixel imaging. (arXiv:2205.08736v1 [eess.IV])
    Single-pixel imaging (SPI) is a novel, unconventional method that goes beyond the notion of traditional cameras but can be computationally expensive and slow for real-time applications. Deep learning has been proposed as an alternative approach for solving the SPI reconstruction problem, but a detailed analysis of its performance and generated basis patterns when used for SPI is limited. We present a modified deep convolutional autoencoder network (DCAN) for SPI on 64x64 pixel images with up to 6.25% compression ratio and apply binary and orthogonality regularizers during training. Training a DCAN with these regularizers allows it to learn multiple measurement bases that have combinations of binary or non-binary, and orthogonal or non-orthogonal patterns. We compare the reconstruction quality, orthogonality of the patterns, and robustness to noise of the resulting DCAN models to traditional SPI reconstruction algorithms (such as Total Variation minimization and Fourier Transform). Our DCAN models can be trained to be robust to noise while still having fast enough reconstruction times (~3 ms per frame) to be viable for real-time imaging.  ( 2 min )
    Revisiting PINNs: Generative Adversarial Physics-informed Neural Networks and Point-weighting Method. (arXiv:2205.08754v1 [cs.LG])
    Physics-informed neural networks (PINNs) provide a deep learning framework for numerically solving partial differential equations (PDEs), and have been widely used in a variety of PDE problems. However, there still remain some challenges in the application of PINNs: 1) the mechanism of PINNs is unsuitable (at least cannot be directly applied) to exploiting a small size of (usually very few) extra informative samples to refine the networks; and 2) the efficiency of training PINNs often becomes low for some complicated PDEs. In this paper, we propose the generative adversarial physics-informed neural network (GA-PINN), which integrates the generative adversarial (GA) mechanism with the structure of PINNs, to improve the performance of PINNs by exploiting only a small size of exact solutions to the PDEs. Inspired from the weighting strategy of the Adaboost method, we then introduce a point-weighting (PW) method to improve the training efficiency of PINNs, where the weight of each sample point is adaptively updated at each training iteration. The numerical experiments show that GA-PINNs outperform PINNs in many well-known PDEs and the PW method also improves the efficiency of training PINNs and GA-PINNs.  ( 2 min )
    CARNet: A Dynamic Autoencoder for Learning Latent Dynamics in Autonomous Driving Tasks. (arXiv:2205.08712v1 [cs.LG])
    Autonomous driving has received a lot of attention in the automotive industry and is often seen as the future of transportation. Passenger vehicles equipped with a wide array of sensors (e.g., cameras, front-facing radars, LiDARs, and IMUs) capable of continuous perception of the environment are becoming increasingly prevalent. These sensors provide a stream of high-dimensional, temporally correlated data that is essential for reliable autonomous driving. An autonomous driving system should effectively use the information collected from the various sensors in order to form an abstract description of the world and maintain situational awareness. Deep learning models, such as autoencoders, can be used for that purpose, as they can learn compact latent representations from a stream of incoming data. However, most autoencoder models process the data independently, without assuming any temporal interdependencies. Thus, there is a need for deep learning models that explicitly consider the temporal dependence of the data in their architecture. This work proposes CARNet, a Combined dynAmic autoencodeR NETwork architecture that utilizes an autoencoder combined with a recurrent neural network to learn the current latent representation and, in addition, also predict future latent representations in the context of autonomous driving. We demonstrate the efficacy of the proposed model in both imitation and reinforcement learning settings using both simulated and real datasets. Our results show that the proposed model outperforms the baseline state-of-the-art model, while having significantly fewer trainable parameters.  ( 2 min )
    QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators. (arXiv:2205.08648v1 [cs.AR])
    As the machine learning and systems community strives to achieve higher energy-efficiency through custom DNN accelerators and model compression techniques, there is a need for a design space exploration framework that incorporates quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QAPPA, a highly parameterized quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate the future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, device bandwidth, number of total processing elements in the the design, and DNN workloads. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our proposed lightweight processing elements achieve up to 4.9x more performance per area and energy improvement when compared to INT16 based implementation.  ( 2 min )
    Evaluation of Transfer Learning for Polish with a Text-to-Text Model. (arXiv:2205.08808v1 [cs.CL])
    We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. In particular, since summarization and question answering lack benchmark datasets for the Polish language, we describe their construction and make them publicly available. Additionally, we present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective. Unsupervised denoising pre-training is performed efficiently by initializing the model weights with a multi-lingual T5 (mT5) counterpart. We evaluate the performance of plT5, mT5, Polish BART (plBART), and Polish GPT-2 (papuGaPT2). The plT5 scores top on all of these tasks except summarization, where plBART is best. In general (except for summarization), the larger the model, the better the results. The encoder-decoder architectures prove to be better than the decoder-only equivalent.  ( 2 min )
    Probability trees and the value of a single intervention. (arXiv:2205.08779v1 [cs.LG])
    The most fundamental problem in statistical causality is determining causal relationships from limited data. Probability trees, which combine prior causal structures with Bayesian updates, have been suggested as a possible solution. In this work, we quantify the information gain from a single intervention and show that both the anticipated information gain, prior to making an intervention, and the expected gain from an intervention have simple expressions. This results in an active-learning method that simply selects the intervention with the highest anticipated gain, which we illustrate through several examples. Our work demonstrates how probability trees, and Bayesian estimation of their parameters, offer a simple yet viable approach to fast causal induction.  ( 2 min )
    Markov Chain Monte Carlo for Continuous-Time Switching Dynamical Systems. (arXiv:2205.08803v1 [cs.LG])
    Switching dynamical systems are an expressive model class for the analysis of time-series data. As in many fields within the natural and engineering sciences, the systems under study typically evolve continuously in time, it is natural to consider continuous-time model formulations consisting of switching stochastic differential equations governed by an underlying Markov jump process. Inference in these types of models is however notoriously difficult, and tractable computational schemes are rare. In this work, we propose a novel inference algorithm utilizing a Markov Chain Monte Carlo approach. The presented Gibbs sampler allows to efficiently obtain samples from the exact continuous-time posterior processes. Our framework naturally enables Bayesian parameter estimation, and we also include an estimate for the diffusion covariance, which is oftentimes assumed fixed in stochastic differential equation models. We evaluate our framework under the modeling assumption and compare it against an existing variational inference approach.  ( 2 min )
    Accurate Fairness: Improving Individual Fairness without Trading Accuracy. (arXiv:2205.08704v1 [cs.LG])
    Accuracy and fairness are both crucial aspects for trustworthy machine learning. However, in practice, enhancing one aspect may sacrifice the other inevitably. We propose in this paper a new fairness criterion, accurate fairness, to assess whether an individual is treated both accurately and fairly regardless of protected attributes. We further propose new fairness metrics, fair-precision, fair-recall and fair-F1 score, to evaluate the reliability of a machine learning model from the perspective of accurate fairness. Thus, the side effects of enhancing just one of the two aspects, i.e., true bias and false fairness, can be effectively identified with our criterion. We then present a fair Siamese approach for accurate fairness training. To the best of our knowledge, this is the first time that a Siamese approach is adapted for bias mitigation. Case studies with typical fairness benchmarks demonstrate that our fair Siamese approach can, on average, promote the 17.4% higher individual fairness, the 11.5% higher fair-F1 score, and the 4.7% higher accuracy of a machine learning model than the state-of-the-art bias mitigation techniques. Finally, our approach is applied to mitigate the possible service discrimination with a real Ctrip dataset, by fairly serving on average 97.9% customers with different consumption habits who pay the same prices for the same rooms (20.7% more than original models).  ( 2 min )
    The Solvability of Interpretability Evaluation Metrics. (arXiv:2205.08696v1 [cs.LG])
    Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency, which are motivated by the principle that more important features -- as judged by the explanation -- should have larger impacts on model prediction. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric and solve it using beam search. This brings up the obvious question: given such solvability, why do we still develop other explainers and then evaluate them on the metric? We present a series of investigations showing that this beam search explainer is generally comparable or favorable to current choices such as LIME and SHAP, suggest rethinking the goals of model interpretability, and identify several directions towards better evaluations of new method proposals.  ( 2 min )
    A Regression Approach to Learning-Augmented Online Algorithms. (arXiv:2205.08717v1 [cs.LG])
    The emerging field of learning-augmented online algorithms uses ML techniques to predict future input parameters and thereby improve the performance of online algorithms. Since these parameters are, in general, real-valued functions, a natural approach is to use regression techniques to make these predictions. We introduce this approach in this paper, and explore it in the context of a general online search framework that captures classic problems like (generalized) ski rental, bin packing, minimum makespan scheduling, etc. We show nearly tight bounds on the sample complexity of this regression problem, and extend our results to the agnostic setting. From a technical standpoint, we show that the key is to incorporate online optimization benchmarks in the design of the loss function for the regression problem, thereby diverging from the use of off-the-shelf regression tools with standard bounds on statistical error.  ( 2 min )
    Property Unlearning: A Defense Strategy Against Property Inference Attacks. (arXiv:2205.08821v1 [cs.CR])
    During the training of machine learning models, they may store or "learn" more information about the training data than what is actually needed for the prediction or classification task. This is exploited by property inference attacks which aim at extracting statistical properties from the training data of a given model without having access to the training data itself. These properties may include the quality of pictures to identify the camera model, the age distribution to reveal the target audience of a product, or the included host types to refine a malware attack in computer networks. This attack is especially accurate when the attacker has access to all model parameters, i.e., in a white-box scenario. By defending against such attacks, model owners are able to ensure that their training data, associated properties, and thus their intellectual property stays private, even if they deliberately share their models, e.g., to train collaboratively, or if models are leaked. In this paper, we introduce property unlearning, an effective defense mechanism against white-box property inference attacks, independent of the training data type, model task, or number of properties. Property unlearning mitigates property inference attacks by systematically changing the trained weights and biases of a target model such that an adversary cannot extract chosen properties. We empirically evaluate property unlearning on three different data sets, including tabular and image data, and two types of artificial neural networks. Our results show that property unlearning is both efficient and reliable to protect machine learning models against property inference attacks, with a good privacy-utility trade-off. Furthermore, our approach indicates that this mechanism is also effective to unlearn multiple properties.  ( 2 min )
    TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision. (arXiv:2205.08731v1 [cs.LG])
    Nowadays, deep neural networks outperform humans in many tasks. However, if the input distribution drifts away from the one used in training, their performance drops significantly. Recently published research has shown that adapting the model parameters to the test sample can mitigate this performance degradation. In this paper, we therefore propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples. Using the provided prototypes of SwAV and our derived test-time loss, we align the representation of unseen test samples with the self-supervised learned prototypes. We show the success of our method on the common benchmark dataset CIFAR10-C.  ( 2 min )
    It Isn't Sh!tposting, It's My CAT Posting. (arXiv:2205.08710v1 [cs.CV])
    In this paper, we describe a novel architecture which can generate hilarious captions for a given input image. The architecture is split into two halves, i.e. image captioning and hilarious text conversion. The architecture starts with a pre-trained CNN model, VGG16 in this implementation, and applies attention LSTM on it to generate normal caption. These normal captions then are fed forward to our hilarious text conversion transformer which converts this text into something hilarious while maintaining the context of the input image. The architecture can also be split into two halves and only the seq2seq transformer can be used to generate hilarious caption by inputting a sentence.This paper aims to help everyday user to be more lazy and hilarious at the same time by generating captions using CATNet.  ( 2 min )
    Exact Gaussian Processes for Massive Datasets via Non-Stationary Sparsity-Discovering Kernels. (arXiv:2205.09070v1 [stat.ML])
    A Gaussian Process (GP) is a prominent mathematical framework for stochastic function approximation in science and engineering applications. This success is largely attributed to the GP's analytical tractability, robustness, non-parametric structure, and natural inclusion of uncertainty quantification. Unfortunately, the use of exact GPs is prohibitively expensive for large datasets due to their unfavorable numerical complexity of $O(N^3)$ in computation and $O(N^2)$ in storage. All existing methods addressing this issue utilize some form of approximation -- usually considering subsets of the full dataset or finding representative pseudo-points that render the covariance matrix well-structured and sparse. These approximate methods can lead to inaccuracies in function approximations and often limit the user's flexibility in designing expressive kernels. Instead of inducing sparsity via data-point geometry and structure, we propose to take advantage of naturally-occurring sparsity by allowing the kernel to discover -- instead of induce -- sparse structure. The premise of this paper is that GPs, in their most native form, are often naturally sparse, but commonly-used kernels do not allow us to exploit this sparsity. The core concept of exact, and at the same time sparse GPs relies on kernel definitions that provide enough flexibility to learn and encode not only non-zero but also zero covariances. This principle of ultra-flexible, compactly-supported, and non-stationary kernels, combined with HPC and constrained optimization, lets us scale exact GPs well beyond 5 million data points.
    Bagged Polynomial Regression and Neural Networks. (arXiv:2205.08609v1 [stat.ML])
    Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of bagged polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset.  ( 2 min )
    Need is All You Need: Homeostatic Neural Networks Adapt to Concept Shift. (arXiv:2205.08645v1 [cs.LG])
    In living organisms, homeostasis is the natural regulation of internal states aimed at maintaining conditions compatible with life. Typical artificial systems are not equipped with comparable regulatory features. Here, we introduce an artificial neural network that incorporates homeostatic features. Its own computing substrate is placed in a needful and vulnerable relation to the very objects over which it computes. For example, artificial neurons performing classification of MNIST digits or Fashion-MNIST articles of clothing may receive excitatory or inhibitory effects, which alter their own learning rate as a direct result of perceiving and classifying the digits. In this scenario, accurate recognition is desirable to the agent itself because it guides decisions to regulate its vulnerable internal states and functionality. Counterintuitively, the addition of vulnerability to a learner does not necessarily impair its performance. On the contrary, self-regulation in response to vulnerability confers benefits under certain conditions. We show that homeostatic design confers increased adaptability under concept shift, in which the relationships between labels and data change over time, and that the greatest advantages are obtained under the highest rates of shift. This necessitates the rapid un-learning of past associations and the re-learning of new ones. We also demonstrate the superior abilities of homeostatic learners in environments with dynamically changing rates of concept shift. Our homeostatic design exposes the artificial neural network's thinking machinery to the consequences of its own "thoughts", illustrating the advantage of putting one's own "skin in the game" to improve fluid intelligence.  ( 2 min )
    Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical Imaging. (arXiv:2205.08576v1 [cs.CV])
    The curation of large-scale medical datasets from multiple institutions necessary for training deep learning models is challenged by the difficulty in sharing patient data with privacy-preserving. Federated learning (FL), a paradigm that enables privacy-protected collaborative learning among different institutions, is a promising solution to this challenge. However, FL generally suffers from performance deterioration due to heterogeneous data distributions across institutions and the lack of quality labeled data. In this paper, we present a robust and label-efficient self-supervised FL framework for medical image analysis. Specifically, we introduce a novel distributed self-supervised pre-training paradigm into the existing FL pipeline (i.e., pre-training the models directly on the decentralized target task datasets). Built upon the recent success of Vision Transformers, we employ masked image encoding tasks for self-supervised pre-training, to facilitate more effective knowledge transfer to downstream federated models. Extensive empirical results on simulated and real-world medical imaging federated datasets show that self-supervised pre-training largely benefits the robustness of federated models against various degrees of data heterogeneity. Notably, under severe data heterogeneity, our method, without relying on any additional pre-training data, achieves an improvement of 5.06%, 1.53% and 4.58% in test accuracy on retinal, dermatology and chest X-ray classification compared with the supervised baseline with ImageNet pre-training. Moreover, we show that our self-supervised FL algorithm generalizes well to out-of-distribution data and learns federated models more effectively in limited label scenarios, surpassing the supervised baseline by 10.36% and the semi-supervised FL method by 8.3% in test accuracy.  ( 2 min )
    Variational Quantum Compressed Sensing for Joint User and Channel State Acquisition in Grant-Free Device Access Systems. (arXiv:2205.08603v1 [eess.SP])
    This paper introduces a new quantum computing framework integrated with a two-step compressed sensing technique, applied to a joint channel estimation and user identification problem. We propose a variational quantum circuit (VQC) design as a new denoising solution. For a practical grant-free communications system having correlated device activities, variational quantum parameters for Pauli rotation gates in the proposed VQC system are optimized to facilitate to the non-linear estimation. Numerical results show that the VQC method can outperform modern compressed sensing techniques using an element-wise denoiser.  ( 2 min )
    Learning Quantum Entanglement Distillation with Noisy Classical Communications. (arXiv:2205.08561v1 [quant-ph])
    Quantum networking relies on the management and exploitation of entanglement. Practical sources of entangled qubits are imperfect, producing mixed quantum state with reduced fidelity with respect to ideal Bell pairs. Therefore, an important primitive for quantum networking is entanglement distillation, whose goal is to enhance the fidelity of entangled qubits through local operations and classical communication (LOCC). Existing distillation protocols assume the availability of ideal, noiseless, communication channels. In this paper, we study the case in which communication takes place over noisy binary symmetric channels. We propose to implement local processing through parameterized quantum circuits (PQCs) that are optimized to maximize the average fidelity, while accounting for communication errors. The introduced approach, Noise Aware-LOCCNet (NA-LOCCNet), is shown to have significant advantages over existing protocols designed for noiseless communications.  ( 2 min )
    Hierarchical Distribution-Aware Testing of Deep Learning. (arXiv:2205.08589v1 [cs.SE])
    With its growing use in safety/security-critical applications, Deep Learning (DL) has raised increasing concerns regarding its dependability. In particular, DL has a notorious problem of lacking robustness. Despite recent efforts made in detecting Adversarial Examples (AEs) via state-of-the-art attacking and testing methods, they are normally input distribution agnostic and/or disregard the perception quality of AEs. Consequently, the detected AEs are irrelevant inputs in the application context or unnatural/unrealistic that can be easily noticed by humans. This may lead to a limited effect on improving the DL model's dependability, as the testing budget is likely to be wasted on detecting AEs that are encountered very rarely in its real-life operations. In this paper, we propose a new robustness testing approach for detecting AEs that considers both the input distribution and the perceptual quality of inputs. The two considerations are encoded by a novel hierarchical mechanism. First, at the feature level, the input data distribution is extracted and approximated by data compression techniques and probability density estimators. Such quantified feature level distribution, together with indicators that are highly correlated with local robustness, are considered in selecting test seeds. Given a test seed, we then develop a two-step genetic algorithm for local test case generation at the pixel level, in which two fitness functions work alternatively to control the quality of detected AEs. Finally, extensive experiments confirm that our holistic approach considering hierarchical distributions at feature and pixel levels is superior to state-of-the-arts that either disregard any input distribution or only consider a single (non-hierarchical) distribution, in terms of not only the quality of detected AEs but also improving the overall robustness of the DL model under testing.  ( 2 min )
    Frank Wolfe Meets Metric Entropy. (arXiv:2205.08634v1 [stat.ML])
    The Frank-Wolfe algorithm has seen a resurgence in popularity due to its ability to efficiently solve constrained optimization problems in machine learning and high-dimensional statistics. As such, there is much interest in establishing when the algorithm may possess a "linear" $O(\log(1/\epsilon))$ dimension-free iteration complexity comparable to projected gradient descent. In this paper, we provide a general technique for establishing domain specific and easy-to-estimate lower bounds for Frank-Wolfe and its variants using the metric entropy of the domain. Most notably, we show that a dimension-free linear upper bound must fail not only in the worst case, but in the \emph{average case}: for a Gaussian or spherical random polytope in $\mathbb{R}^d$ with $\mathrm{poly}(d)$ vertices, Frank-Wolfe requires up to $\tilde\Omega(d)$ iterations to achieve a $O(1/d)$ error bound, with high probability. We also establish this phenomenon for the nuclear norm ball. The link with metric entropy also has interesting positive implications for conditional gradient algorithms in statistics, such as gradient boosting and matching pursuit. In particular, we show that it is possible to extract fast-decaying upper bounds on the excess risk directly from an analysis of the underlying optimization procedure.  ( 2 min )
    Quantum Transfer Learning for Wi-Fi Sensing. (arXiv:2205.08590v1 [cs.LG])
    Beyond data communications, commercial-off-the-shelf Wi-Fi devices can be used to monitor human activities, track device locomotion, and sense the ambient environment. In particular, spatial beam attributes that are inherently available in the 60-GHz IEEE 802.11ad/ay standards have shown to be effective in terms of overhead and channel measurement granularity for these indoor sensing tasks. In this paper, we investigate transfer learning to mitigate domain shift in human monitoring tasks when Wi-Fi settings and environments change over time. As a proof-of-concept study, we consider quantum neural networks (QNN) as well as classical deep neural networks (DNN) for the future quantum-ready society. The effectiveness of both DNN and QNN is validated by an in-house experiment for human pose recognition, achieving greater than 90% accuracy with a limited data size.  ( 2 min )
    Classification as Direction Recovery: Improved Guarantees via Scale Invariance. (arXiv:2205.08633v1 [stat.ML])
    Modern algorithms for binary classification rely on an intermediate regression problem for computational tractability. In this paper, we establish a geometric distinction between classification and regression that allows risk in these two settings to be more precisely related. In particular, we note that classification risk depends only on the direction of the regressor, and we take advantage of this scale invariance to improve existing guarantees for how classification risk is bounded by the risk in the intermediate regression problem. Building on these guarantees, our analysis makes it possible to compare algorithms more accurately against each other and suggests viewing classification as unique from regression rather than a byproduct of it. While regression aims to converge toward the conditional expectation function in location, we propose that classification should instead aim to recover its direction.  ( 2 min )
    Strategizing against Learners in Bayesian Games. (arXiv:2205.08562v1 [cs.LG])
    We study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer can guarantee to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these algorithms be implemented efficiently? While building this theory of optimizer-learner interactions, we define a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings.  ( 2 min )
    All-Photonic Artificial Neural Network Processor Via Non-linear Optics. (arXiv:2205.08608v1 [physics.optics])
    Optics and photonics has recently captured interest as a platform to accelerate linear matrix processing, that has been deemed as a bottleneck in traditional digital electronic architectures. In this paper, we propose an all-photonic artificial neural network processor wherein information is encoded in the amplitudes of frequency modes that act as neurons. The weights among connected layers are encoded in the amplitude of controlled frequency modes that act as pumps. Interaction among these modes for information processing is enabled by non-linear optical processes. Both the matrix multiplication and element-wise activation functions are performed through coherent processes, enabling the direct representation of negative and complex numbers without the use of detectors or digital electronics. Via numerical simulations, we show that our design achieves a performance commensurate with present-day state-of-the-art computational networks on image-classification benchmarks. Our architecture is unique in providing a completely unitary, reversible mode of computation. Additionally, the computational speed increases with the power of the pumps to arbitrarily high rates, as long as the circuitry can sustain the higher optical power.  ( 2 min )
    Multibit Tries Packet Classification with Deep Reinforcement Learning. (arXiv:2205.08606v1 [cs.NI])
    High performance packet classification is a key component to support scalable network applications like firewalls, intrusion detection, and differentiated services. With ever increasing in the line-rate in core networks, it becomes a great challenge to design a scalable and high performance packet classification solution using hand-tuned heuristics approaches. In this paper, we present a scalable learning-based packet classification engine and its performance evaluation. By exploiting the sparsity of ruleset, our algorithm uses a few effective bits (EBs) to extract a large number of candidate rules with just a few of memory access. These effective bits are learned with deep reinforcement learning and they are used to create a bitmap to filter out the majority of rules which do not need to be full-matched to improve the online system performance. Moreover, our EBs learning-based selection method is independent of the ruleset, which can be applied to varying rulesets. Our multibit tries classification engine outperforms lookup time both in worst and average case by 55% and reduce memory footprint, compared to traditional decision tree without EBs.  ( 2 min )
    A graph representation of molecular ensembles for polymer property prediction. (arXiv:2205.08619v1 [cs.LG])
    Synthetic polymers are versatile and widely used materials. Similar to small organic molecules, a large chemical space of such materials is hypothetically accessible. Computational property prediction and virtual screening can accelerate polymer design by prioritizing candidates expected to have favorable properties. However, in contrast to organic molecules, polymers are often not well-defined single structures but an ensemble of similar molecules, which poses unique challenges to traditional chemical representations and machine learning approaches. Here, we introduce a graph representation of molecular ensembles and an associated graph neural network architecture that is tailored to polymer property prediction. We demonstrate that this approach captures critical features of polymeric materials, like chain architecture, monomer stoichiometry, and degree of polymerization, and achieves superior accuracy to off-the-shelf cheminformatics methodologies. While doing so, we built a dataset of simulated electron affinity and ionization potential values for >40k polymers with varying monomer composition, stoichiometry, and chain architecture, which may be used in the development of other tailored machine learning approaches. The dataset and machine learning models presented in this work pave the path toward new classes of algorithms for polymer informatics and, more broadly, introduce a framework for the modeling of molecular ensembles.  ( 2 min )
    Generic and Trend-aware Curriculum Learning for Relation Extraction in Graph Neural Networks. (arXiv:2205.08625v1 [cs.CL])
    We present a generic and trend-aware curriculum learning approach for graph neural networks. It extends existing approaches by incorporating sample-level loss trends to better discriminate easier from harder samples and schedule them for training. The model effectively integrates textual and structural information for relation extraction in text graphs. Experimental results show that the model provides robust estimations of sample difficulty and shows sizable improvement over the state-of-the-art approaches across several datasets.  ( 2 min )
    Learning to Learn Quantum Turbo Detection. (arXiv:2205.08611v1 [eess.SP])
    This paper investigates a turbo receiver employing a variational quantum circuit (VQC). The VQC is configured with an ansatz of the quantum approximate optimization algorithm (QAOA). We propose a 'learning to learn' (L2L) framework to optimize the turbo VQC decoder such that high fidelity soft-decision output is generated. Besides demonstrating the proposed algorithm's computational complexity, we show that the L2L VQC turbo decoder can achieve an excellent performance close to the optimal maximum-likelihood performance in a multiple-input multiple-output system.  ( 2 min )
    Deep Neural Network Classifier for Multi-dimensional Functional Data. (arXiv:2205.08592v1 [stat.ML])
    We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets.  ( 2 min )
    CV4Code: Sourcecode Understanding via Visual Code Representations. (arXiv:2205.08585v1 [cs.SE])
    We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.  ( 2 min )
    OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval. (arXiv:2205.08605v1 [cs.CL])
    Aligning parallel sentences in multilingual corpora is essential to curating data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result on the Tateoba dataset, outperforming an equally-sized previous model by 8.0 points in accuracy while using less than 0.6% of their parallel data. When finetuned on a single rich-resource language pair, be it English-centered or not, our model is able to match the performance of the ones finetuned on all language pairs under the same data budget with less than 2.0 points decrease in accuracy. Furthermore, with the same setup, scaling up the number of rich-resource language pairs monotonically improves the performance, reaching a minimum of 0.4 points discrepancy in accuracy, making it less mandatory to collect any low-resource parallel data. Finally, we conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size, up to a certain size threshold, rather than on what language pairs are used for training or evaluation.  ( 2 min )
    The Power of Reuse: A Multi-Scale Transformer Model for Structural Dynamic Segmentation in Symbolic Music Generation. (arXiv:2205.08579v1 [cs.SD])
    Symbolic Music Generation relies on the contextual representation capabilities of the generative model, where the most prevalent approach is the Transformer-based model. Not only that, the learning of long-term context is also related to the dynamic segmentation of musical structures, i.e. intro, verse and chorus, which is currently overlooked by the research community. In this paper, we propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level, respectively. Concretely, we designed a Fragment Scope Localization layer to syncopate the music into sections, which were later used to pre-train fine-decoders. After that, we designed a Music Style Normalization layer to transfer the style information from the original sections to the generated sections to achieve consistency in music style. The generated sections are combined in the aggregation layer and fine-tuned by the coarse decoder. Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models. More excitingly, visual evaluation shows that our model is superior in melody reuse, resulting in more realistic music.  ( 2 min )
    Universal characteristics of deep neural network loss surfaces from random matrix theory. (arXiv:2205.08601v1 [math-ph])
    This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networks and demonstrate the important role of random matrix local laws in popular pre-conditioning gradient descent algorithms. We also present insights into deep neural network loss surfaces from quite general arguments based on tools from statistical physics and random matrix theory.  ( 2 min )
  • Open

    A Unified Linear Speedup Analysis of Stochastic FedAvg and Nesterov Accelerated FedAvg. (arXiv:2007.05690v3 [cs.LG] UPDATED)
    Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly with regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)--arguably the most popular and effective FL algorithm class in use today--and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, a systematic study of how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting is lacking--a crucial issue whose answer would shed light on the performance of FedAvg in large FL systems in practice. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under strongly convex smooth, convex smooth problems, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in convex settings. Empirical studies of the algorithms in various settings have supported our theoretical results.
    Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation. (arXiv:2205.09029v1 [stat.ML])
    Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow's hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
    Dependent Latent Class Models. (arXiv:2205.08677v1 [stat.ML])
    Latent Class Models (LCMs) are used to cluster multivariate categorical data (e.g. group participants based on survey responses). Traditional LCMs assume a property called conditional independence. This assumption can be restrictive, leading to model misspecification and overparameterization. To combat this problem, we developed a novel Bayesian model called a Dependent Latent Class Model (DLCM), which permits conditional dependence. We verify identifiability of DLCMs. We also demonstrate the effectiveness of DLCMs in both simulations and real-world applications. Compared to traditional LCMs, DLCMs are effective in applications with time series, overlapping items, and structural zeroes.
    Bayesian Discrete Conditional Transformation Models. (arXiv:2205.08594v1 [stat.ME])
    We propose a novel Bayesian model framework for discrete ordinal and count data based on conditional transformations of the responses. The conditional transformation function is estimated from the data in conjunction with an a priori chosen reference distribution. For count responses, the resulting transformation model is novel in the sense that it is a Bayesian fully parametric yet distribution-free approach that can additionally account for excess zeros with additive transformation function specifications. For ordinal categoric responses, our cumulative link transformation model allows the inclusion of linear and nonlinear covariate effects that can additionally be made category-specific, resulting in (non-)proportional odds or hazards models and more, depending on the choice of the reference distribution. Inference is conducted by a generic modular Markov chain Monte Carlo algorithm where multivariate Gaussian priors enforce specific properties such as smoothness on the functional effects. To illustrate the versatility of Bayesian discrete conditional transformation models, applications to counts of patent citations in the presence of excess zeros and on treating forest health categories in a discrete partial proportional odds model are presented.
    Deep Neural Network Classifier for Multi-dimensional Functional Data. (arXiv:2205.08592v1 [stat.ML])
    We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets.
    Frank Wolfe Meets Metric Entropy. (arXiv:2205.08634v1 [stat.ML])
    The Frank-Wolfe algorithm has seen a resurgence in popularity due to its ability to efficiently solve constrained optimization problems in machine learning and high-dimensional statistics. As such, there is much interest in establishing when the algorithm may possess a "linear" $O(\log(1/\epsilon))$ dimension-free iteration complexity comparable to projected gradient descent. In this paper, we provide a general technique for establishing domain specific and easy-to-estimate lower bounds for Frank-Wolfe and its variants using the metric entropy of the domain. Most notably, we show that a dimension-free linear upper bound must fail not only in the worst case, but in the \emph{average case}: for a Gaussian or spherical random polytope in $\mathbb{R}^d$ with $\mathrm{poly}(d)$ vertices, Frank-Wolfe requires up to $\tilde\Omega(d)$ iterations to achieve a $O(1/d)$ error bound, with high probability. We also establish this phenomenon for the nuclear norm ball. The link with metric entropy also has interesting positive implications for conditional gradient algorithms in statistics, such as gradient boosting and matching pursuit. In particular, we show that it is possible to extract fast-decaying upper bounds on the excess risk directly from an analysis of the underlying optimization procedure.
    Conformalized Online Learning: Online Calibration Without a Holdout Set. (arXiv:2205.09095v1 [cs.LG])
    We develop a framework for constructing uncertainty sets with a valid coverage guarantee in an online setting, in which the underlying data distribution can drastically -- and even adversarially -- shift over time. The technique we propose is highly flexible as it can be integrated with any online learning algorithm, requiring minimal implementation effort and computational cost. A key advantage of our method over existing alternatives -- which also build on conformal inference -- is that we do not need to split the data into training and holdout calibration sets. This allows us to fit the predictive model in a fully online manner, utilizing the most recent observation for constructing calibrated uncertainty sets. Consequently, and in contrast with existing techniques, (i) the sets we build can quickly adapt to new changes in the distribution; and (ii) our procedure does not require refitting the model at each time step. Using synthetic and real-world benchmark data sets, we demonstrate the validity of our theory and the improved performance of our proposal over existing techniques. To demonstrate the greater flexibility of the proposed method, we show how to construct valid intervals for a multiple-output regression problem that previous sequential calibration methods cannot handle due to impractical computational and memory requirements.
    Dynamic Predictions of Postoperative Complications from Explainable, Uncertainty-Aware, and Multi-Task Deep Neural Networks. (arXiv:2004.12551v2 [cs.LG] UPDATED)
    Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform random forest models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.
    A simple yet effective baseline for non-attributed graph classification. (arXiv:1811.03508v3 [cs.LG] UPDATED)
    Graphs are complex objects that do not lend themselves easily to typical learning tasks. Recently, a range of approaches based on graph kernels or graph neural networks have been developed for graph classification and for representation learning on graphs in general. As the developed methodologies become more sophisticated, it is important to understand which components of the increasingly complex methods are necessary or most effective. As a first step, we develop a simple yet meaningful graph representation, and explore its effectiveness in graph classification. We test our baseline representation for the graph classification task on a range of graph datasets. Interestingly, this simple representation achieves similar performance as the state-of-the-art graph kernels and graph neural networks for non-attributed graph classification. Its performance on classifying attributed graphs is slightly weaker as it does not incorporate attributes. However, given its simplicity and efficiency, we believe that it still serves as an effective baseline for attributed graph classification. Our graph representation is efficient (linear-time) to compute. We also provide a simple connection with the graph neural networks. Note that these observations are only for the task of graph classification while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.
    Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias. (arXiv:2112.06868v2 [cs.LG] UPDATED)
    Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
    Doubly Robust Collaborative Targeted Learning for Debiased Recommendations. (arXiv:2203.10258v2 [cs.IR] UPDATED)
    In recommender systems, the collected data always contains various biases and leads to the challenge of accurate predictions. To address selection bias and confounding bias, the doubly robust (DR) method and its variants show superior performance due to the double robustness property and smaller bias under inaccurate propensity and error imputation models. However, we theoretically show that the variance of the error imputation-based (EIB) method is much smaller than that of DR, although EIB may suffer from a much larger bias. In this paper, we propose a doubly robust targeted learning method that effectively combines the small-bias property of DR and the small-variance property of EIB, by leveraging the targeted maximum likelihood estimation technique. Theoretical analysis shows that the proposed targeted learning is effective in reducing the variance of DR while maintaining double robustness. To further reduce the bias and variance during the training process, we propose a novel collaborative targeted learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.
    Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling. (arXiv:2205.08766v1 [cs.LG])
    Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing on joint predictives: we discuss online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and we propose new challenging evaluation settings using active learning and active sampling. These settings are motivated by an examination of marginal and joint predictives, their respective cross-entropies, and their place in offline and online learning. They are more realistic than previously suggested ones, building on work by Wen et al. (2021) and Osband et al. (2022), and focus on evaluating the performance of approximate BNNs in an online supervised setting. Initial experiments, however, raise questions on the feasibility of these ideas in high-dimensional parameter spaces with current BDL inference techniques, and we suggest experiments that might help shed further light on the practicality of current research for these problems. Importantly, our work highlights previously unidentified gaps in current research and the need for better approximate joint predictives.
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v1 [cs.LG])
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Our result may already hold for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.
    A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits. (arXiv:2106.05472v2 [math.PR] UPDATED)
    This paper studies a multi-armed bandit problem where the decision-maker is loss averse, in particular she is risk averse in the domain of gains and risk loving in the domain of losses. The focus is on large horizons. Consequences of loss aversion for asymptotic (large horizon) properties are derived in a number of analytical results. The analysis is based on a new central limit theorem for a set of measures under which conditional variances can vary in a largely unstructured history-dependent way subject only to the restriction that they lie in a fixed interval.
    GPU-accelerated partially linear multiuser detection for 5G and beyond URLLC systems. (arXiv:2201.05024v3 [eess.SP] UPDATED)
    In this feasibility study, we have implemented a recently proposed partially linear multiuser detection algorithm in reproducing kernel Hilbert spaces (RKHSs) on a GPU-accelerated platform. Partially linear multiuser detection, which combines the robustness of linear detection with the power of nonlinear methods, has been proposed for a massive connectivity scenario with the non-orthogonal multiple access (NOMA). This is a promising approach, but detecting payloads within a received orthogonal frequency division multiplexing (OFDM) radio frame requires the execution of a large number of inner product operations, which are the main computational burden of the algorithm. Although inner-product operations consist of simple kernel evaluations, their vast number poses a challenge in ultra-low latency (ULL) applications, because the time needed for computing the inner products might exceed the sub-millisecond latency requirement. To address this problem, this study demonstrates the acceleration of the inner-product operations through massive parallelization. The result is a GPU-accelerated real-time OFDM receiver that enables sub-millisecond latency detection to meet the requirements of 5th generation (5G) and beyond ultra-reliable and low latency communications (URLLC) systems. Moreover, the parallelization and acceleration techniques explored and demonstrated in this study can be extended to many other signal processing algorithms in Hilbert spaces, such as those based on projection onto convex sets (POCS) and adaptive projected subgradient method (APSM) algorithms. Experimental results and comparisons with the state-of-art confirm the effectiveness of our techniques.
    New Lower Bounds for Private Estimation and a Generalized Fingerprinting Lemma. (arXiv:2205.08532v2 [cs.DS] UPDATED)
    We prove new lower bounds for statistical estimation tasks under the constraint of $(\varepsilon, \delta)$-differential privacy. First, we provide tight lower bounds for private covariance estimation of Gaussian distributions. We show that estimating the covariance matrix in Frobenius norm requires $\Omega(d^2)$ samples, and in spectral norm requires $\Omega(d^{3/2})$ samples, both matching upper bounds up to logarithmic factors. We prove these bounds via our main technical contribution, a broad generalization of the fingerprinting method to exponential families. Additionally, using the private Assouad method of Acharya, Sun, and Zhang, we show a tight $\Omega(d/(\alpha^2 \varepsilon))$ lower bound for estimating the mean of a distribution with bounded covariance to $\alpha$-error in $\ell_2$-distance. Prior known lower bounds for all these problems were either polynomially weaker or held under the stricter condition of $(\varepsilon,0)$-differential privacy.
    Fair and Green Hyperparameter Optimization via Multi-objective and Multiple Information Source Bayesian Optimization. (arXiv:2205.08835v1 [cs.LG])
    There is a consensus that focusing only on accuracy in searching for optimal machine learning models amplifies biases contained in the data, leading to unfair predictions and decision supports. Recently, multi-objective hyperparameter optimization has been proposed to search for machine learning models which offer equally Pareto-efficient trade-offs between accuracy and fairness. Although these approaches proved to be more versatile than fairness-aware machine learning algorithms -- which optimize accuracy constrained to some threshold on fairness -- they could drastically increase the energy consumption in the case of large datasets. In this paper we propose FanG-HPO, a Fair and Green Hyperparameter Optimization (HPO) approach based on both multi-objective and multiple information source Bayesian optimization. FanG-HPO uses subsets of the large dataset (aka information sources) to obtain cheap approximations of both accuracy and fairness, and multi-objective Bayesian Optimization to efficiently identify Pareto-efficient machine learning models. Experiments consider two benchmark (fairness) datasets and two machine learning algorithms (XGBoost and Multi-Layer Perceptron), and provide an assessment of FanG-HPO against both fairness-aware machine learning algorithms and hyperparameter optimization via a multi-objective single-source optimization algorithm in BoTorch, a state-of-the-art platform for Bayesian Optimization.
    Meta-Learning Sparse Compression Networks. (arXiv:2205.08957v1 [stat.ML])
    Recent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations (INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ stateof-the-art network sparsification techniques to drastically improve compression. Secondly, introduce the first method allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, drastically improving both compression and the computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities such as images, manifolds, signed distance functions, 3D shapes and scenes, several of which establish new state-of-the-art results.
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v2 [q-bio.NC] UPDATED)
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.
    On the Efficiency of Entropic Regularized Algorithms for Optimal Transport. (arXiv:1906.01437v9 [cs.DS] UPDATED)
    We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as \textit{Greenkhorn}, from $\widetilde{O}(n^2\varepsilon^{-3})$ to $\widetilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by~\citet{Altschuler-2017-Near}. Second, we propose a new algorithm, which we refer to as \textit{APDAMD} and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm~\citep{Dvurechensky-2018-Computational} with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\widetilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\widetilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\widetilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a \textit{deterministic} accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\widetilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM)~\citep{Guminov-2021-Combination} in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.
    Classification as Direction Recovery: Improved Guarantees via Scale Invariance. (arXiv:2205.08633v1 [stat.ML])
    Modern algorithms for binary classification rely on an intermediate regression problem for computational tractability. In this paper, we establish a geometric distinction between classification and regression that allows risk in these two settings to be more precisely related. In particular, we note that classification risk depends only on the direction of the regressor, and we take advantage of this scale invariance to improve existing guarantees for how classification risk is bounded by the risk in the intermediate regression problem. Building on these guarantees, our analysis makes it possible to compare algorithms more accurately against each other and suggests viewing classification as unique from regression rather than a byproduct of it. While regression aims to converge toward the conditional expectation function in location, we propose that classification should instead aim to recover its direction.
    The Kernelized Taylor Diagram. (arXiv:2205.08864v1 [stat.ML])
    This paper presents the kernelized Taylor diagram, a graphical framework for visualizing similarities between data populations. The kernelized Taylor diagram builds on the widely used Taylor diagram, which is used to visualize similarities between populations. However, the Taylor diagram has several limitations such as not capturing non-linear relationships and sensitivity to outliers. To address such limitations, we propose the kernelized Taylor diagram. Our proposed kernelized Taylor diagram is capable of visualizing similarities between populations with minimal assumptions of the data distributions. The kernelized Taylor diagram relates the maximum mean discrepancy and the kernel mean embedding in a single diagram, a construction that, to the best of our knowledge, have not been devised prior to this work. We believe that the kernelized Taylor diagram can be a valuable tool in data visualization.
    Exact Gaussian Processes for Massive Datasets via Non-Stationary Sparsity-Discovering Kernels. (arXiv:2205.09070v1 [stat.ML])
    A Gaussian Process (GP) is a prominent mathematical framework for stochastic function approximation in science and engineering applications. This success is largely attributed to the GP's analytical tractability, robustness, non-parametric structure, and natural inclusion of uncertainty quantification. Unfortunately, the use of exact GPs is prohibitively expensive for large datasets due to their unfavorable numerical complexity of $O(N^3)$ in computation and $O(N^2)$ in storage. All existing methods addressing this issue utilize some form of approximation -- usually considering subsets of the full dataset or finding representative pseudo-points that render the covariance matrix well-structured and sparse. These approximate methods can lead to inaccuracies in function approximations and often limit the user's flexibility in designing expressive kernels. Instead of inducing sparsity via data-point geometry and structure, we propose to take advantage of naturally-occurring sparsity by allowing the kernel to discover -- instead of induce -- sparse structure. The premise of this paper is that GPs, in their most native form, are often naturally sparse, but commonly-used kernels do not allow us to exploit this sparsity. The core concept of exact, and at the same time sparse GPs relies on kernel definitions that provide enough flexibility to learn and encode not only non-zero but also zero covariances. This principle of ultra-flexible, compactly-supported, and non-stationary kernels, combined with HPC and constrained optimization, lets us scale exact GPs well beyond 5 million data points.
    Distribution-free Prediction Sets Adaptive to Unknown Covariate Shift. (arXiv:2203.06126v3 [stat.ME] UPDATED)
    Predicting sets of outcomes -- instead of unique outcomes -- is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift -- a prevalent issue in practice -- poses a serious challenge and has yet to be fully solved. In this paper, we propose a novel flexible distribution-free method, PredSet-1Step, to construct prediction sets that can efficiently adapt to unknown covariate shift. We formally show that our method is \textit{asymptotically probably approximately correct}, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators. This is a technical tool of independent interest.
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v2 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). Then, we approximate the resultant GP by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). ICK is flexible and can be used to include prior information in neural networks in many applications. We demonstrate the strength of our framework by showing its superior performance and flexibility on both synthetic and real-world data sets. The code is available at: https://anonymous.4open.science/r/ICK_NNGP-17C5/.
    Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement. (arXiv:1810.09103v3 [cs.LG] UPDATED)
    Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.
    A label efficient two-sample test. (arXiv:2111.08861v3 [cs.LG] UPDATED)
    Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at \url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.
    Model-based Clustering with Missing Not At Random Data. (arXiv:2112.10425v2 [stat.ML] UPDATED)
    Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase$^{\mbox{\normalsize{\textregistered}}}$ dataset.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v3 [stat.ME] UPDATED)
    Recent advances in probabilistic deep learning enable amortized Bayesian inference in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations represent reality somewhat inaccurately? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of SNPE-C (APT) and the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the learned latent data summary space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase when the distance between the latent summary distributions of the true data-generating process and the training simulations grows. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized simulation-based Bayesian inference.
    FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting. (arXiv:2205.08897v1 [cs.LG])
    Recent studies have shown the promising performance of deep learning models (e.g., RNN and Transformer) for long-term time series forecasting. These studies mostly focus on designing deep models to effectively combine historical information for long-term forecasting. However, the question of how to effectively represent historical information for long-term forecasting has not received enough attention, limiting our capacity to exploit powerful deep learning models. The main challenge in time series representation is how to handle the dilemma between accurately preserving historical information and reducing the impact of noisy signals in the past. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM} for short: it introduces Legendre Polynomial projections to preserve historical information accurately and Fourier projections plus low-rank approximation to remove noisy signals. Our empirical studies show that the proposed FiLM improves the accuracy of state-of-the-art models by a significant margin (\textbf{19.2\%}, \textbf{22.6\%}) in multivariate and univariate long-term forecasting, respectively. In addition, dimensionality reduction introduced by low-rank approximation leads to a dramatic improvement in computational efficiency. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the performance of most deep learning modules for long-term forecasting. Code will be released soon
    SoQal: Selective Oracle Questioning for Consistency Based Active Learning of Cardiac Signals. (arXiv:2004.09557v3 [cs.LG] UPDATED)
    Clinical settings are often characterized by abundant unlabelled data and limited labelled data. This is typically driven by the high burden placed on oracles (e.g., physicians) to provide annotations. One way to mitigate this burden is via active learning (AL) which involves the (a) acquisition and (b) annotation of informative unlabelled instances. Whereas previous work addresses either one of these elements independently, we propose an AL framework that addresses both. For acquisition, we propose Bayesian Active Learning by Consistency (BALC), a sub-framework which perturbs both instances and network parameters and quantifies changes in the network output probability distribution. For annotation, we propose SoQal, a sub-framework that dynamically determines whether, for each acquired unlabelled instance, to request a label from an oracle or to pseudo-label it instead. We show that BALC can outperform start-of-the-art acquisition functions such as BALD, and SoQal outperforms baseline methods even in the presence of a noisy oracle.
    Sharp asymptotics on the compression of two-layer neural networks. (arXiv:2205.08199v2 [cs.IT] UPDATED)
    In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M < N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N . For a ReLU activation function, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.
    Bagged Polynomial Regression and Neural Networks. (arXiv:2205.08609v1 [stat.ML])
    Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of bagged polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v2 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.

  • Open

    [D] Adversarial testing for Fairness and Biases
    Is it worthwhile exposing ML models to the public for adversarially testing if it meets fairness conditions given that there are so many edge cases to test for when deploying ML model? (Similar to the concept of bug bounties in cybersecurity) submitted by /u/blitzkreig3 [link] [comments]
    [P] Keras Launches a Computer Vision Extension Package
    Keras has launched a computer vision extension package. ​ Links: - https://keras.io/keras_cv/ - https://github.com/keras-team/keras-cv/ submitted by /u/puppet_pals [link] [comments]
    [N] Introducing Accelerated PyTorch Training on Mac
    https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/ submitted by /u/eigenlaplace [link] [comments]  ( 1 min )
    [Discussion] Are there any better Topic Modelling algorithms/models other than LDA?
    As the title suggests, could someone point me to some new Topic Modelling algorithms that have come up recently that are in some way better? My use-case is towards modelling tweets, if that helps! TIA EDIT: Approach is location based, meaning I will be increasing the document length by merging nearby tweets. submitted by /u/mrnerdy59 [link] [comments]  ( 1 min )
    [R] How is F1-score a better metric for unbalanced data sets ?
    Alright so bear with me; I have an unbalanced data set, let's say I have a +1 class accounting for 80% of my data set and a -1 class representing the remaining 20%. I fit some kind of model and report both accuracy and F1-score. F1-score is supposed to counteract the skewed data distribution but since F1 is computed on the +1 class, which is the majority class, it still going to be biased right ? To me, the answer would be taking the average F1-score for both classes, which sklearn calls the "macro" F1 score (I'm aware there are 2 very different takes on how "macro" metrics are computed) but people on the internet seem to get very angry when we use macro F1 for binary classification. Although it sounds like a very reasonable choice to me...What do you guys think ? ​ Note: I'm doing a thesis in quantitative finance and a lot of papers use accuracy to artificially boost their models predictive power, I'm trying to prove why this shouldn't be and why macro f1 would yield much more realistic results. submitted by /u/delta9_ [link] [comments]  ( 2 min )
    [D] Those at FAANG, do you use AutoML?
    Do you use AutoML? I see Microsoft, Google, Amazon and etc... all have their own AutoML SDK/Services. submitted by /u/stevofolife [link] [comments]  ( 1 min )
    [D] KDD Notifications Thread
    I don't think I usually see the data mining conference much here. So anyway, hoping to hear that KDD might love me submitted by /u/EdwardRaff [link] [comments]
    [N] Flower Summit 2022
    On May 31, 2022, the Flower Community will come together for the second Flower Summit 2022. Join experts in the field of federated learning and find out how Flower accelerates the development of systems in both research and production scenarios. All speakers and the corresponding time schedule are final now. You can expect speakers from Intel, Google/MLCommons, Brave, University of Cambridge, AI Sweden, and many more. Block your calendar and register now: https://flower.dev/conf/flower-summit-2022/ https://preview.redd.it/1hcnl3f678091.png?width=1200&format=png&auto=webp&s=b1cc18b40435e8c9ae21d825b43a5b2143fafc2e submitted by /u/burnai [link] [comments]  ( 1 min )
    [D] MNIST equivalent dataset for RNN/LSTM/Transformer?
    Hi, I'm developing a course material and looking for a nice introductory task/dataset for RNN/LSTM/Transformers. I can use recurrent networks for MNIST too, but I'm looking for a more "classic" example such as sequence prediction or classification. Is there any? Preferably it's relatively simple, clean and well-known dataset like MNIST. Thank you very much. submitted by /u/euske [link] [comments]  ( 2 min )
    [R] Handcrafted localized phase features for human action recognition
    This paper https://paperswithcode.com/paper/handcrafted-localized-phase-features-for claims to achieve 98% top-1 accuracy on kinetics-400 and 96.35 on kinetics-700. From their description, they compute phase-correlation on large patches between consecutive frames and then use that in a knn-classifier. I didn't find any extra info in the paper about the method and frankly I find it hard to believe this beats all of the recent state-of-the art methods. What do you think? Maybe a (possibly uninteded) foul in the evaluation method? submitted by /u/AmirRosenfeld [link] [comments]  ( 1 min )
    [R] Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks
    submitted by /u/martenlienen [link] [comments]  ( 1 min )
    [D] Paper for Object Detection
    Recently, my team and I created a dataset with annotated labels for object detection. We plan to make a paper out of it. However, object detection is quite trivial and by that I mean that we try 3-4 approaches (SSD, Faster R-CNN, etc) with some optimization (e.g. same parameters on all approaches and fine-tune best performed?). Is it good material for a paper? I think if we present the findings as a dataset and benchmark experiments it would be sufficient. Shall we try more complex pipelines? submitted by /u/giakou4 [link] [comments]  ( 2 min )
    [D] Any recommended do's/don't for rebuttal phase ?
    Rebuttal is the stage where pre-lim reviews have been released to the authors. Now is the time for authors to address the concerns of reviewers. Do you have some essential do's/don't guidelines which you follow? It maybe about writing responses which disagree with a reviewer. Perhaps, presents results on additional experiments which were requested. submitted by /u/PaganPasta [link] [comments]  ( 1 min )
    [N] Apple Executive Who Left Over Return-to-Office Policy Joins Google AI Unit: Ian Goodfellow, a former director of machine learning at Apple, is joining DeepMind.
    According to an article published in Bloomberg, An Apple Inc. executive who left over the company’s stringent return-to-office policy is joining Alphabet Inc.’s DeepMind unit, according to people with knowledge of the matter. Ian Goodfellow, who oversaw machine learning and artificial intelligence at Apple, left the iPhone maker in recent weeks, citing the lack of flexibility in its work policies. The company had been planning to require corporate employees to work from the office on Mondays, Tuesdays and Thursdays, starting this month. That deadline was put on hold Tuesday, though. https://www.bloomberg.com/news/articles/2022-05-17/ian-goodfellow-former-apple-director-of-machine-learning-to-join-deepmind submitted by /u/hardmaru [link] [comments]  ( 2 min )
    [D] Are there any Tracking approaches using DeepSORT + Detector that isnt Yolo
    Hey, ive been wondering, all Repositories I could find on the internet are Yolo Detectors combined with DeepSort. I couldnt find others, what Im interested in the most would be Efficientdet or CentreNet combined with DeepSORT, are there any Repositories you guys know of? The last few Days I found my faszination for the Tensorflow 2 Object Detection API, I was wondering If I could use my weights from there combined with DeepSORT Best Regards submitted by /u/HolidayLobster1355 [link] [comments]  ( 1 min )
  • Open

    Google AI explains how Assistant answers your follow-up questions
    submitted by /u/Zirius_Sadfaces [link] [comments]
    The dream of destruction (made with starryai)
    submitted by /u/akhlys98 [link] [comments]
    Whats are your opinions on The Vital Intelligence the body vital scanning technology
    submitted by /u/BossBossZR [link] [comments]
    AI Dream 48 - Space Odyssey HAL 9000
    submitted by /u/LordPewPew777 [link] [comments]
    Researchers use artificial intelligence to help autonomous vehicles avoid idling at red lights.
    submitted by /u/qptbook [link] [comments]
    Glass painting decor.
    submitted by /u/cookingandcraft [link] [comments]
    This article showcases how to annotate semi- structured texts whether its pdfs or scanned images using UBIAI’s annotation tool
    submitted by /u/UBIAI [link] [comments]
    Yikes, the subbot has gone a little racist
    submitted by /u/orgeezuz [link] [comments]
    Using Deep Learning for Programming Language Translation
    submitted by /u/VikasOjha666 [link] [comments]
    Google has found a new use case for large AI language models: Job interviews.
    submitted by /u/much_successes [link] [comments]
    Purple Sunset (made with starryai)
    submitted by /u/Losthel [link] [comments]
    Your AI Weekly Digest (newsletter)! I just shared a new iteration covering BlobGAN
    submitted by /u/OnlyProggingForFun [link] [comments]
    Twitter Art Club
    submitted by /u/VIRUS-AOTOXIN [link] [comments]
    AIRS in the AIR: Modular Self-reconfigurable Robot (Session 2) Speakers: Michael Rubenstein and Tin Lun Lam
    submitted by /u/nousetest [link] [comments]  ( 1 min )
    Exploring the potential of MIDI sequence generation using Machine Learning - Survey
    I am a student currently studying Software Development at MCAST. For my degree thesis, I am exploring the potential of MIDI sequence generation using Machine Learning techniques. To evaluate the implemented algorithm, I created a questionnaire which asks respondents to rate 10 different samples. No personally identifiable information will be collected in this questionnaire. I would greatly appreciate if you can spare around 5 minutes to take part in this questionnaire. https://www.survio.com/survey/d/Y1W7D8P1X3J7F8U7Y submitted by /u/drinu98 [link] [comments]  ( 1 min )
    9 Best Artificial Intelligence books for beginners to expert to read in 2022
    submitted by /u/maneesh123456 [link] [comments]
  • Open

    Logging in Python
    Logging is a way to store information about your script and track events that occur. When writing any complex script in Python, logging is essential for debugging software as you develop it. Without logging, finding the source of a problem in your code may be extremely time consuming. After completing this tutorial, you will know: […] The post Logging in Python appeared first on Machine Learning Mastery.  ( 22 min )
  • Open

    DALL·E 2 Research Preview Update
    Early users have created over 3 million images to date and helped us improve our safety processes. We're excited to begin adding up to 1,000 new users from our waitlist each week.  ( 1 min )
  • Open

    How to concatenate feature vectors of different length?
    I have a very basic question. In an architecture like the one in the attached image, how do you concretely concatenate the feature vector coming from the conv nets and the attention WITH the feature vector coming from the scalar observation (direction, position, etc.)? One of them might be something like a 1x10000 vector, while the other two might be way shorter, say 1x4 for the direction and 1x2 for the position ​ https://arxiv.org/abs/2104.07750 submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    Generative Trajectory Modelling : a "complete shift" in the Reinforcement Learning paradigm.
    submitted by /u/moschles [link] [comments]  ( 1 min )
    Installing & Using MuJoCo 2.1.5 with OpenAi Gym
    Hi all, I finally got my environment set up with MuJoCo and now I would like to use it through OpenAI Gym to train some agents for. I was reading that before deepmind took it over the installation process was very annoying. Is it still that way or is there an easier solution meanwhile? Following the instruction on https://github.com/openai/mujoco-py just with the 2.1.5 version and I get following error when I try to import mujoco_py ​ Also generally asked: Is there a better alternative to OpenAI Gym to work with the newer Mujoco versions? I'ld be very thankfull for any tipps :) ​ https://preview.redd.it/i47fibtpw7091.png?width=1113&format=png&auto=webp&s=7f6300a24b75e2c9d7938859b6de35982a502801 submitted by /u/disdisinform [link] [comments]  ( 2 min )
    10 Best Deep Reinforcement Learning Courses
    submitted by /u/MlTut [link] [comments]
    Replacing Model Based Predictive Control with RL, does anyone have any resources that I can look into for such an activity... for now I'm just thinking about using an active MPC for "training".
    submitted by /u/veezion123 [link] [comments]  ( 1 min )
    Double DQN algorithms converge on only one action.
    I have taken some reference implementations of DDQN algorithm and am trying to create an agent which can trade in the forex market. Unfortunately from the 2nd trial onwards (after training the DDQN for the first time) , the probability distribution of the actions converges on only action and the loss and the reward loss fluctuates. Dataset - 13k Batch_size - 64 Update_rl - 6 Learning rate - 0.001 Gamma - 0.99 Reward - -1 to 1(depend upon profit and loss) submitted by /u/laxuu [link] [comments]  ( 1 min )
    How to Leverage Reinforcement Learning • Phil Winder & Rebecca Nugent
    submitted by /u/goto-con [link] [comments]
    Are there any text-based game environments for RL agents to train on similar to Textworld or Jericho?
    submitted by /u/cz1xrnvz [link] [comments]  ( 1 min )
  • Open

    Use Amazon Lex to capture street addresses
    Amazon Lex provides automatic speech recognition (ASR) and natural language understanding (NLU) technologies to transcribe user input, identify the nature of their request, and efficiently manage conversations. Lex lets you create sophisticated conversations, streamline your user experience to improve customer satisfaction (CSAT) scores, and increase containment in your contact centers. Natural, effective customer interactions require […]  ( 10 min )
  • Open

    Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps
    Wow, my blog “Is Data Mesh Fool’s Gold? Creating a Business-centric Data Strategy” created quite a stir.  And that was my intention. I actually believe that the Data Mesh is an important data management and governance framework (yes, the Data Mesh is more of a framework than a technology) for helping organizations deliver a business-driven… Read More »Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps The post Part 2: Is Data Mesh Fool’s Gold? Not if You Avoid the Traps appeared first on Data Science Central.  ( 5 min )
  • Open

    Vector-Quantized Image Modeling with Improved VQGAN
    Posted by Jiahui Yu, Senior Research Scientist, and Jing Yu Koh, Research Software Engineer, Google Research In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora. This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer …  ( 7 min )
  • Open

    How to Make Oracle AIs Safe
    The Counterfactual Oracle AI Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 4 min )
  • Open

    Living better with algorithms
    Graduate student Sarah Cen explores the interplay between humans and artificial intelligence systems, to help build accountability and trust.  ( 7 min )

  • Open

    [N] Gradio Blocks + Hugging Face event is starting this week. A hackathon type event from May 17th to May 31st with prizes in which we will create interactive web demos for state-of-the-art machine learning models
    We are happy to invite you to the Gradio Blocks Party - a community event in which we will create interactive demos for state-of-the-art machine learning models. Demos are powerful because they allow anyone — not just ML engineers — to try out models in the browser, give feedback on predictions, identify trustworthy models. The event will take place from May 17th to 31st. We will be organizing this event on Huggingface: https://huggingface.co/Gradio-Blocks and the Hugging Face discord channel. Prizes will be given at the end of the event, see the Prizes section We will be building demos using the new Gradio Blocks API. Blocks allows you to build web-based demos in a flexible way using the Gradio library. Gradio is a popular choice for building demos for machine learning models, as it allows you to create web-based UIs all in Python. For example, here is a UI for Dall-E Mini using Gradio Blocks: ​ https://reddit.com/link/ury6a9/video/p8m2arag24091/player ​ submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    Are there any research groups that try to find quantitative laws of deep learning? [D]
    I’m a graduate student in a deep learning application area. How model development works is that you have a recipe of tricks and guidelines you play around with and cross-validate different versions of until you find one that works well. Most research trying to improve accuracy on proxy datasets seems a waste of time to me and I’m more interested in fundamental patterns in learning. There is this wealth of experiments the community is producing but the theoretical insight is nothing more than a bunch of rules of thumb. I feel like it’s time to split apart experimental and theoretical deep learning like experimental and theoretical Physics co-exists and try and figure out quantitative laws of deep learning. I get that deep learning is a non convex optimisation problem and that we can’t prove much, but what we can do is try and figure laws out. Newtons law of gravity was also not proven from some others axioms. While deep learning does not seem fully amenable to mathematical analysis, it is to physical analysis. I’m asking for some theories, not theorems. Any one know of any work on this or work in progress? I’m not talking about physics-informed deep learning, DL applied to physics or physical theory used to improve DL; just to be clear. Work I would find interesting would try to, from the wealth of experimental DL data, uncover fundamental, quantitative relationships between architecture, optimisation procedure, dataset and performance. submitted by /u/lemlo100 [link] [comments]  ( 2 min )
    [D]Practical correlation metric for a large number of vectors
    I am dealing with a timeseries consisting of input flow sampled every 5 minutes over 441 days. My aim is to find any possible correlation from data coming from: The same day of the week The same moment in time ( EX. 14:35,14:40 etc) I proceeded to sample according to weekdays and hours. Then I computed the 63x63 correlation matrix for each of the weekdays and a 441x441 for each hour, which in the second case is pretty impractical. I feel like the dataset is too broad to categorise it with a simple yes-or-no question yet that's what I've been tasked with.So my question is if I can try to do autocorrelation and if it results in some parameter p and q in ARIMA model or would you suggest me another more succinct approach that may give a broader picture of data? submitted by /u/Ambitious-Donut1321 [link] [comments]  ( 3 min )
    [D] What are the best information retrieval methods today?
    Given the advent of powerful language models using transformers, what IR systems are most effective today? Would love to read recent papers and blogs about the same! submitted by /u/evilBotman [link] [comments]  ( 1 min )
    [D] Is representative training data always a good thing?
    https://preview.redd.it/65gejwx5q2091.png?width=1000&format=png&auto=webp&s=107c54464a27b005bf139eabd405134dafe94d15 More like this at: https://www.evilaicartoons.com/ submitted by /u/HAIL-9000 [link] [comments]
    [D] Could you technically use continuous numeric labels instead of one-hot labels for multiclass classification?
    This would work almost like an embedding for label classes, every class corresponds to a vector of continuous numeric values. You could identify the correct class based on a nearest neighbouring label vector for a given inference. The only problem I see is how you would initialize the label embedding before training. One advantage I see is that you could add classes without retraining the network. submitted by /u/mrwafflezzz [link] [comments]  ( 1 min )
    [D] Which software packages are used for evaluation of automatic summarization besides ROUGE?
    Is there any easily findable software besides ROUGE for this purpose? submitted by /u/Goldback_Gorilla [link] [comments]
    [D] Understanding audio sampling function for a speech synthesis WGAN
    Hello, I've recently made a post on this subreddit asking for clarifications on a certain paper describing an implementation of WGAN-GP for speech synthesis from silent videos. Those answers were all really helpful in better understanding the learning process, however more have cropped up as I began digging deeper. I'm currently attempting training a hybrid model between the architectures described in these two papers, with a generator and objective function from the former and the critic and PASE blocks from the latter. After not seeing any signs of convergence after 12 hours of training (~20 epochs on 4 speakers from the GRID corpus, roughly 8000 seconds of data), I read further and found some info regarding the data sampling in both articles. The first one states: The audio clips …  ( 2 min )
    [D] Estimation of marginal likelihood - what's the SOTA
    Hi! I face a problem in which I should estimate the marginal likelihood. My differentiable model is of moderate dimensionality (circa 50) and I do have a decent sample from the posterior. The model is expensive to calculate/propagate and it's implemented in an AD framework. I checked some of the works in the field, many of them seem to be developed for cosmological data. I am aware of nested sampling, subdomain methods of M. D. Weinberg and Delaunay triangulation approaches. Still, as I completely fresh to the field, I don't know what to expect w.r.t. accuracy of the methods. What is your experience? Do you know of any decent comparison studies or other techniques worth exploring? submitted by /u/msusik [link] [comments]  ( 1 min )
    [D] How to choose dimensions for latent space ?
    Hi, I want to make a clustering on bioinformatics data. I have a matrix of 129 samples and 3000 features (OTUs). I was thinking as a preprocessing step to reduce the dimensions of this Matrix and see if I can achieve better clustering. I thought of using Autoencoders to get the latent representation of the data and then apply probably k-means and see the results. How can I choose the latent space dimensions ? I have read some papers where they do the same but they don't explain how they choose the dimensions. submitted by /u/grisp98 [link] [comments]  ( 5 min )
  • Open

    Ensemble Reinforcement Learning
    Could you point me to some papers that use ensembling techniques to improve efficiency of RL algorithms please? Is this an area that is actively researched? If so, what is the SOTA for it? submitted by /u/SirRantcelot [link] [comments]  ( 1 min )
    How to teach agent not to take invalid actions?
    Hi, I'm working in an environment where there are 3 possible actions, but some of them are sometimes invalid/impossible, depending on the state . I am giving the key piece of information about the current state (binary 0/1) to the agent in the observation space (among other info), in hopes that it would figure out that (when state=0 , I cannot take action 1 but when state=1, I cannot take action 3). To do this, I have tried: Give a huge penalty upon using an invalid move and ending the episode Give a huge penalty upon using an invalid move, not changing the state and allowing episode to continue Neither of these seemed to have any effect, as the agent continued making invalid moves. Any advice or insight would be greatly appreciated! I'm pretty new to RL. Thank you! submitted by /u/VladimirB-98 [link] [comments]  ( 2 min )
    Simulating random RGB images and observation space for RL model
    Hi, I'm a bit confused about the usage of RGB images in RL. I'm trying to create random tensors (that are meant to imitate a visual observation and an observation space) to feed in my model and see whether it works correctly. What I don't understand is: concretely, how would I randomly initialize a tensor for the observation space (for the init method) and one for the observation (for the forward method)? Would this be a realistic observation? And how would I initialize the obs space? obs = torch.randn(4, 64, 64, 3) Second question: as far as I understood, the first dimension with PyTorch is the batch size. What does it represent exactly? submitted by /u/No_Possibility_7588 [link] [comments]  ( 2 min )
    It is ok to roast me, but I need help about my project.
    I know that connect 4 is solved, but I am trying to make a value based method rl agents to play connect 4 with MARL. I trained both agents many times and the policy always converge to a very poor situation. They just commit and rush 4 at one of the column, not trying to block opponent, and the agent who go first always win after training a long time. here is my code. Please give me suggestions to improve and I know that MCTS is a good for this case, but I havent learn it yet, currently focusing on value based method, and also is eligibility traces dead? Note: I tried ANN(flatten the state), and it converge to stupid strategy faster, now I am using CNN(also get stupid policy) submitted by /u/Professional_Card176 [link] [comments]  ( 2 min )
    Observation vector comprising only of previous action and reward: Isn't that a multi-armed bandits problem?
    Hello redditors of RL, I am doing joint research on RL and Wireless Comms. and I am observing a trend in a lot of the problem formulations people use there: Sometimes, the observation vector of the "MDP" is defined as simply containing the past action and reward (usually without any additional information). Given that all algorithms collect experience tuples of (s, a, r, s'), would you agree with the following statements? Assuming a discrete action space, if st contains only [at-1,rt-1] , isn't that the same as having no observations? Since you already have this information in your experience tuple. Taking it a step further, isn't that a multi-armed bandits scenario? I.e. assuming the stochastic process that generates the rewards is stationary, the optimal "policy" essentially selects always one action. This is not an MDP (or rather, it is "trivially" an MDP), won't you agree? Even if st includes other information, isn't the incorporation of [at-1,rt-1] simply unnecessary? Assuming continuous action space, couldn't this problem be treated similar to the (discrete) multi-armed bandits problem, as long as you adopt a parametric model for learning the distributions of the rewards conditioned on the actions? submitted by /u/SomeParanoidAndroid [link] [comments]  ( 2 min )
    Sustainability applications
    Is there any recent paper or work done applying RL in the field of sustainability and sustainable development? submitted by /u/blitzkreig3 [link] [comments]
  • Open

    Contextual Rephrasing in Google Assistant
    Posted by Aurelien Boffy, Senior Staff Software Engineer, and Roberto Pieraccini, Engineering Director, Google Assistant When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by…  ( 7 min )
  • Open

    Can artificial intelligence overcome the challenges of the health care system?
    MIT and Mass General Brigham researchers and physicians connect in person to bring AI into mainstream health care.  ( 6 min )
    On the road to cleaner, greener, and faster driving
    Researchers use artificial intelligence to help autonomous vehicles avoid idling at red lights.  ( 7 min )
  • Open

    How much money does one day intensive use of openai cost ?
    submitted by /u/huberpaul [link] [comments]  ( 1 min )
    Microsoft AI Team Introduces “Federated Learning Utilities and Tools for Experimentation” (FLUTE): A High-Performance Open-Source Platform For Federated Learning Research And Offline Simulations
    Distributed Training (DT), which focuses on scaling the model training process via model or data parallelism, has gotten much interest because of an increase in training datasets. On the other hand, DT makes some assumptions, particularly in terms of communication and network parameters. Furthermore, new data management restrictions are arising due to the growing requirement for personal data protection, making data more inaccessible due to storage behind firewalls or on users’ devices without the option of being shared for centralized training. Federated Learning is a decentralized machine learning approach that emphasizes collaborative training and data privacy for users. The central concept underlying federated learning is that these machine learning models are highly versatile when it comes to training sophisticated models over large amounts of data without having to share that data with a centralized body. Despite its popularity as a research topic, it is challenging to deploy since it differs significantly from typical machine learning pipelines. Local data variety, end-node hardware diversity, privacy concerns, and optimization limits are challenges in federated learning. Furthermore, federated learning applications frequently need to extend the learning process to millions of clients to imitate a real-world environment. These difficulties highlight the necessity for a simulation platform that allows researchers and developers to conduct proof-of-concept implementations and verify performance before creating and deploying their machine learning models. Quick Read Paper: https://arxiv.org/pdf/2203.13789.pdf Github: https://github.com/microsoft/msrflute submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Overview of XGBoost and Gradient Boosting
    submitted by /u/aidev2040 [link] [comments]
    "I used AI to generate a music video from song lyrics" by [DoodleChos]
    submitted by /u/VegiHarry [link] [comments]  ( 1 min )
    I've finally launched MagicNote - a new platform where you can write better in your second language with GPT-3
    Hello Guys, I'm happy to announce I've finally launched MagicNote - a new platform where you can write better on demand for your work or business in your second language using gpt-3. https://magicnote.ai It was a wild ride getting here, but I finally made it after 10-11 months of hard work. The majority of my time was spent validating the idea, not the product. I've done my best to make my AI writing assistant as affordable as possible – it's one of the cheapest on the market. For me, this is more about making a difference in people’s lives than making money. To be honest, I developed it primarily for myself and have been still using it. Communication with colleagues, clients, potential customers, or employees matters, especially if it is in your foreign language writing. MagicNote helps me improve the quality of my communication with such people while writing on demand. Just thought it could be beneficial to thousands of other people like me. Feel free to join the platform if you are one of them. Looking forward to your feedback. submitted by /u/data-gig [link] [comments]  ( 1 min )
    This video is part of my entry to the Future of Life Institute's Worldbuild AI project to envision a positive future with AGI in 2045. I'd love feedback!
    submitted by /u/Turil [link] [comments]  ( 1 min )
    [English Speakers] [Feedback] [Survey] [Populism] [AI] I would greatly appreciate your feedback on my survey
    Hello! For my Bachelor's thesis, I am examining the effects of populism on people's attitudes. Before I send out my survey for data collection I would like to ask for some of your feedback with the hopes of improving my questions and stimulus material. I would greatly appreciate it if you could take 5-7 minutes to fill out this survey and then let me know what you liked and disliked about the study. Thank you for your help! https://uvacommscience.eu.qualtrics.com/jfe/form/SV_3w9GzcvxQ7fWRzo submitted by /u/Psychological_Face69 [link] [comments]  ( 1 min )
    Javascript vs Python for variety of uses (AI, App Development, etc…)
    I currently know Java basics well (I have completed AP CS A) but the language doesn’t completely suit my needs. I am currently in the summer prior to doing my bachelors in Computer Science. Over the next few months, I would like to begin learning Artificial intelligence and I would like to first learn the syntax of the language I’ll use. Along with AI, I’d like to have the freedom of programming apps on both IOS & Android, and also maybe get into web development (maybe make basic websites where I can apply my knowledge in AI, create a personal portfolio, etc…). I would like for the language that I will mainly use to be easily used in things like app and web development along with AI in order for me to be able to apply the things I learn in AI through personal projects. This is really important for me because as I was learning Java for school, I was extremely limited. For example, Java app development kept me stuck with android and I couldn't approach AI with Java. I’m leaning more towards JavaScript as it’s a language I’ve never tried but I am not completely sure what to choose that will be the most suiting. submitted by /u/obvslynot [link] [comments]  ( 2 min )
    26 Images Created by Dall-E 2
    submitted by /u/kbf_ [link] [comments]
    Nvidia Canvas is still really cool for anyone waiting for DALL E 2
    submitted by /u/BeginningRealistic49 [link] [comments]
    The shell plugin I wrote writes your git commands
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
  • Open

    Overview of XGBoost and Gradient Boosting
    submitted by /u/aidev2040 [link] [comments]
    Machine Learning with Harsh
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
  • Open

    Mission Made Possible: Real-Time Rendering Helps Studio Create Cinematic Battle Between Characters From ‘Diablo Immortal’
    Real-time rendering is helping one studio take virtual production to impossible heights. In their latest project, the creators at Los Angeles-based company Impossible Objects were tasked with depicting an epic battle between characters from the upcoming video game, Diablo Immortal. But the showdown had to take place on the surface of a Google Pixel phone, Read article > The post Mission Made Possible: Real-Time Rendering Helps Studio Create Cinematic Battle Between Characters From ‘Diablo Immortal’ appeared first on NVIDIA Blog.  ( 3 min )
    AI on the Ball: Startup Shoots Computer Vision to the Soccer Pitch
    Eyal Ben-Ari just took his first shot on a goal of bringing professional-class analytics to amateur soccer players. The CEO of startup Track160, in Tel Aviv, has seen his company’s AI-powered sports analytics software tested and used in the big leagues. Now he’s turning his attention to underserved amateurs in the clubs and community teams Read article > The post AI on the Ball: Startup Shoots Computer Vision to the Soccer Pitch appeared first on NVIDIA Blog.  ( 3 min )
    Concept Artist Pablo Muñoz Gómez Enlivens Fantasy Creatures ‘In the NVIDIA Studio’
    Concept artist Pablo Muñoz Gómez dives In the NVIDIA Studio this week, showcasing artwork that depicts a fantastical myth. Gómez, a creator based in Australia, is equally passionate about helping digital artists, teaching 3D classes and running the Zbrush guides website with his creative specialties: concept and character artistry. The post Concept Artist Pablo Muñoz Gómez Enlivens Fantasy Creatures ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Customize pronunciation using lexicons in Amazon Polly
    Amazon Polly is a text-to-speech service that uses advanced deep learning technologies to synthesize natural-sounding human speech. It is used in a variety of use cases, such as contact center systems, delivering conversational user experiences with human-like voices for automated real-time status check, automated account and billing inquiries, and by news agencies like The Washington […]  ( 7 min )
  • Open

    Chemical element abbreviation patterns
    I’ve wondered occasionally about the patterns in how chemical elements are abbreviated. If you don’t know the abbreviation for an element, is there a simple algorithm that would let you narrow the range of possibilities or improve your odds at guessing? Here’s a survey of how the elements are abbreviated. Latin and German The elements […] Chemical element abbreviation patterns first appeared on John D. Cook.  ( 3 min )
  • Open

    How to Use an AI Story Generator to Write Your Stories
    PS: This entire article was written by an AI story generator: Jasper AI.  ( 4 min )
  • Open

    AI Harms are Societal, Not Just Individual
    Not just Individual, but Societal Harms Case Study: Privacy and surveillance Case Study: Disinformation and erosion of trust Individual Harms, Individual Solutions Parallels with Environmental Harms Directions Forward Not just Individual, but Societal Harms When the USA government switched to facial identification service ID.me for unemployment benefits, the software failed to recognize Bill Baine’s face. While the app said that he could have a virtual appointment to be verified instead, he was unable to get through. The screen had a wait time of 2 hours and 47 minutes that never updated, even over the course of weeks. He tried calling various offices, his daughter drove in from out of town to spend a day helping him, and yet he was never able to get a useful human answer on what he was su…  ( 7 min )

  • Open

    [D] How to nominate NeurIPS 2022 reviewer?
    I think last year they have a link to nominate/self-nominate reviewer but I couldn't find any this year. Do any of you know how to do it? submitted by /u/Deep_forgetting [link] [comments]
    [D] Anyone still using Stochastic Depth?
    I recently studied ways to improve the training time of big neural networks, especially ResNets. On my way, I could not help but notice the big claims you can find in the paper Deep Networks with Stochastic Depth To summarize informally, their contribution consists of a new hyperparameter for ResBlocks, which is used to skip the inner part of the residual connection with the given probability (they use 0.5 in their experiments). Quoting from the Paper: Let b ∈ {0, 1} denote a Bernoulli random variable [...] If b = 1, eq. (2) reduces to the original ResNet update and this ResBlock remains unchanged. If b = 0, the ResBlock reduces to the identity function. They do not only claim a big improvement on training time Following the calculations above, approximately 25% of training time …  ( 2 min )
    [R] Survey on Misuse of NLP Research
    We at CopeNLU and the Digital Democracies Institute are currently running an online survey on the potential harms and misuses of Natural Language Processing technologies and research. We, therefore, ask researchers in the field of natural language processing to fill out the following survey to give us an insight into their concerns. We would really appreciate it if you could take a few minutes to fill out the survey. The survey takes about 20 minutes to complete and is available here: copenlu.limesurvey.net/987789 submitted by /u/frimelle [link] [comments]  ( 1 min )
    [P] A quick tip on DataFrame.apply
    I have been using pandas for years in many projects. But I still feel pain when applying a function to multiple columns. So recently developed a tool for our project, making life easier for data scientists like me. pandas.DataFrame we used everyday: ```python create a dataframe df = pd.DataFrame({'a': range(5)}) a 0 0 1 1 2 2 3 3 4 4 ``` By wrapping the pandas.DataFrame with Towhee, we have runas_op as an alternative to DataFrame.apply ``` data collection is a wrapper for dataframe dc = towhee.from_df(df) dc.runas_op['a', 'b'](func=lambda x: x+1) a b 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 ``` runas_op['a', 'b'](func=...) tells towhee to use column a as input and column b as output. For multiple inputs functions, we can use a tuple to specify input columns: ```python dc.runas_op[('a', 'b'), 'c'](func=lambda x, y: x + y) a b c 0 0 1 1 1 1 2 3 2 2 3 5 3 3 4 7 4 4 5 9 ``` We can also use a tuple for multiple outputs: ```python apply a multiple output function dc.runas_op['c', ('d', 'e')](func=lambda x: (x+1, x-1)) a b c d e 0 0 1 1 2 0 1 1 2 3 4 2 2 2 3 5 6 4 3 3 4 7 8 6 4 4 5 9 10 8 ``` Towhee provides method-chaining style API, making the code easy to follow: python df = pd.DataFrame({'a': range(5)}) dc = towhee.from_df(df) \ .runas_op['a', 'b'](func=lambda x: x+1) \ .runas_op[('a', 'b'), 'c'](func=lambda x, y: x + y) \ .runas_op['c', ('d', 'e')](func=lambda x: (x+1, x-1)) To convert the data back (as pandas.DataFrame): ```python new_df = dc.df type(new_df) pandas.core.frame.DataFrame ``` The project's homepage is https://github.com/towhee-io/towhee, and you can find more about towhee by going through the documents. Would appreciate some feedback and contribution 🙂 submitted by /u/ok-reiase [link] [comments]  ( 2 min )
    [D] Colab GPU Assignment Algorithm
    Does anyone have a sense of how Colab determines what GPU you get? I have a Pro+ subscription and for a period of a month or two I was regularly getting A100s. As a result I wrote some software that required 40GB of memory. Now I can't seem to ever get an A100. Why can't Colab give any information on this issue Additionally I created a new Colab account to hopefully get the 'new user' bump. After 24 hours on an A100 the new account can't even get a V100 ​ submitted by /u/anon135797531 [link] [comments]  ( 1 min )
    [D] ML Datasets with instance interactions
    Hello, I'm looking for datasets that contain the interaction label of 2 instances, in order to test a DL model. Any type of dataset that has lines like would be a great fit! I was thinking that I could also "handcraft" such datasets, from a classification dataset, where I put the interaction of 2 instances from the same class to be 1, and the interaction of 2 instances from different classes to be 0. However, I'd like to discover dedicated datasets with interactions instead. If you are aware of such datasets, could you please point me to them? Thank you in advance! submitted by /u/Yuu_Aky [link] [comments]  ( 1 min )
    [N] TorchRL: PyTorch pre-release RL library is here!
    PyTorch ecosystem team has opensourced TorchRL, the RL dedicated PyTorch library. It's still WIP and it hasn't been officially released yet, but it's already good enough to be used in common research settings, including online / offline, on-policy / off-policy, meta-RL and such. It is quite efficient for a series of tasks: for model ensembling and meta-RL it leverages functorch's capabilities. Some functions are highly optimized to efficiently run on cuda (e.g. TD(lambda) returns). Examples currently include SAC, DDPG, PPO, REDQ and DQN. Let us know what you think of it, issues and PRs are welcome! Buy it, use it, break it, fix it... Doc and tutorials to come soon! submitted by /u/AdCool8270 [link] [comments]  ( 1 min )
    [D] Why do top speech/audio conferences like ICASSP and Interspeech have very high acceptance rates like 46%-48% ?
    I have heard from my fellow Ph.D students and post-docs that the lower the acceptance rate of conferences, the higher their reputation. I see that this is true for Neurips, ICLR, ICML, ACL etc. But ICASSP and Interspeech are considered top conferences in speech/audio applications. So why are their acceptance rates so high? submitted by /u/Far_Conversation_445 [link] [comments]  ( 3 min )
    [D] Using U-Net for 1D segmentation from 2D images ?
    Hi all ! ​ I am working on building a scanner to digitize film strips using a rectangular area sensor. This means that by using a rectangular sensor, I am lacking the entire context of the film during capture, but still need to align the images properly to the sensor so the resolution can be maximized with automatic capture and film alignment. ​ The issue at hand is then to segment the captured images between "image on the film", and "gap between images" areas. ​ I have tried using traditional CV methods but issues arise with poorly exposed films, since images can become very clear, pretty much as clear as the gaps between images, with only some sparse image elements visible. This is why I believe that the full context of the 2D image is necessary for segmentation. ​ Since the images captured always have the same orientation, and gaps between images are always vertical, I believe that a 1D output would be sufficient, aka "iiiigggiiiiiii" where "g" is a gap and "i" is the image area. The segmented areas should *always* span the full height of the image and have vertical borders if the output of the segmentation is a 2D image, hence why I believe that a 1D output is enough. ​ Is this something that can be done with a U-Net ? I am very much a beginner with this but seeing how it can be used for outputting 2Dx1mono images from 2Dx3 color inputs, I was wondering if this could be done. ​ Thanks a lot in advance ! submitted by /u/iAmTheAlchemist [link] [comments]  ( 2 min )
    [D] Is it possible to submit a pure math paper to NeurIPS?
    I have some theoretical results on a popular algorithm in ML (score-based generative models). But I am from a maths background. All I do is state these results and prove them. The proofs are not extremely deep or involved, but they are rigorous (so maybe a bit pedantic) and would require the reviewer to actually know some stochastic analysis. I do some small numerical experiments on toy data sets in 2 dimensions to illustrate the results. Are there any tips on how I could maximize my chances of such a paper getting published in NeurIPS? Or are my chances very low? submitted by /u/future_gcp_poweruser [link] [comments]  ( 3 min )
    [D] Using Inception and FID scores in training?
    Is it possible to use the Inception and FID scores in the training of a deep image generation model, i.e. to maximize the scores in a loss function, albeit this is "cheating"? If so, has anyone / any paper done it? Thanks for any pointer. submitted by /u/thanrl [link] [comments]  ( 1 min )
    [News] New Google tech - Geospatial API uses computer vision and machine learning to turn 15 years of street view imagery into a 3d canvas for augmented reality developers
    submitted by /u/imaginfinity [link] [comments]  ( 4 min )
  • Open

    Batch size in Spinnigup' PPO?
    Proximal Policy Optimization — Spinning Up documentation (openai.com) I'm trying to understand the code in SpinningUp's implementation of PPO. It seems that when training the policy and value function networks, they used a batch size of 4000, instead of a minibatch. Does anyone know why? Wouldn't a minibatch behave better? submitted by /u/Traditional-Brother9 [link] [comments]  ( 1 min )
    Entity Gym: A new entity based API for reinforcement learning environments
    submitted by /u/Programmierer [link] [comments]  ( 1 min )
    Can n step learning deal with sparse reward?
    submitted by /u/Professional_Card176 [link] [comments]
    "Emergent bartering behaviour in multi-agent reinforcement learning", Johanson et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    How do you choose the size of your observation space when you create a custom environment?
    Perhaps naive question, but this is the first time that I create a custom env: how do you choose the size of, for example, the RGB image you're passing as an observation to the agent? It could be a 24x24x3 image, or it could be a 200x200x3 one, so based on what principle should I choose it? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    PPO - Log Std as trainable parameter?
    In many implementations of the PPO algorithm I see the STD of the policy distribution implemented as a learnable parameter. For example here: https://intellabs.github.io/coach/components/agents/policy_optimization/ppo.html However it seems like, this parameter does not change very much during the course of the training indepently what it's start value was. I would expect it to converge to some common value no matter how I set it initially. Otherwise it is just dependent on it's initial value, which I find very hard to tune. How do you implement the STD of the policy distribution? And how do you set its inital value? I would be glad for any recommendations... https://preview.redd.it/xhss11tuquz81.png?width=354&format=png&auto=webp&s=1a9b0f9b4240ca61fd05fb130bd0ccc5da85f8b9 submitted by /u/flxh13 [link] [comments]  ( 1 min )
    n-step TD vs λ-return? (performance)
    submitted by /u/Professional_Card176 [link] [comments]
    What's the point of the actor in actor critic
    Since the actor seems to be dictated by the crictic I'm not quite sure what the point of the actor is. Ofcourse the actor acts out the action, but could you not just let the critic decide the action by for example taking the highest Q value of converting the Q values to probabilities if you would want a stochastic approach. submitted by /u/Jobdriaan [link] [comments]  ( 2 min )
    Training a RL model to fit a curve
    I'm attempting to train a RL model to fit a curve and would appreciate your input to improve the performance. TL;DR: If anyone has an open source environment that does something similar to curve fitting, I would appreciate having a look at it! My setup is as follows: We have a true curve y_true, evaluated at regular intervals on some domain x. We also have a method to generate a simulated curve y_sim on the same domain x. The method for simulating the curve depends on a set of parameters. The action space is a box of the same dimension as the number of parameters for the curve simulation method. The parameters are then updated with each action as: params += 0.01*action (the actions range between -1 and 1, so 0.01 makes it such that the actions don't modify the parameters to drastically …  ( 3 min )
    TorchRL: PyTorch pre-release RL library is alive!
    PyTorch ecosystem team has opensourced TorchRL, the RL dedicated PyTorch library. It's still WIP and it hasn't been officially released yet, but it's already good enough to be used in common research settings, including online / offline, on-policy / off-policy, meta-RL and such. It is quite efficient for a series of tasks: for model ensembling and meta-RL it leverages functorch's capabilities. Some functions are highly optimized to efficiently run on cuda (e.g. TD(lambda) returns). Examples currently include SAC, DDPG, PPO, REDQ and DQN. Let us know what you think of it, issues and PRs are welcome! Buy it, use it, break it, fix it... Doc and tutorials to come soon! submitted by /u/AdCool8270 [link] [comments]  ( 2 min )
    Are the unsupervised RL experiments carried out correctly?
    Hi. I recently studied Unsupervised RL. It pre-trains agents with task-agnostic rewards and then fine-tunes downstream tasks. But All I saw at URLB that is unsupervised RL benchmark was worse than scratch. In my case, I didn't check all cases because of limited resource, I wonder your case. I can't get clear answer from repository owner and author. walker run and stand state-based action repeat 1 pretrain 2M batchsize 1024, finetune batchsize 256 scratch >> apt-icm walker run and stand pixel-based action repeat 2 pretrain 2M batchsize 1024, finetune batchsize 256 scratch > apt-icm ​ Thank you for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
  • Open

    Lessons From Deploying Deep Learning To Production
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI: Enzyme developed with AI to decompose plastic, Hugging Face reaches $2 billion valuation, new ambitious EU AI Act, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    Breakthrough Google Deepmind Gato General AI Does 600+ Tasks | AI Robot Arm To Disarm Bombs
    submitted by /u/SlightSituation [link] [comments]
    AI Evolution: The Historic Timeline of AI Milestones
    Most folks think Artificial Intelligence (AI) is a novel notion, although it's been around for a long time. We went back in history and curated a list of all key artificial intelligence breakthroughs that have enabled us to live our current lives. Read more submitted by /u/ridamughal110 [link] [comments]
    AI Accelerators - Hardware For Artificial Intelligence
    CPUs were not as powerful and efficient a few decades ago when it came to running large computations for machine learning. Hardware manufacturers have worked hard to create a processing unit capable of performing any AI operation. Read more submitted by /u/ridamughal110 [link] [comments]
    Inworld AI's Ilya Gelfenbeyn and Kylan Gibbs talked about the tech behind their new developer platform for building AI-driven virtual characters and shared how to integrate their solution into games.
    submitted by /u/80lv [link] [comments]  ( 1 min )
    Can large language models be democratized?
    submitted by /u/bendee983 [link] [comments]
    Is it possible to program AI to detect cringe?
    submitted by /u/SimonCZ2 [link] [comments]
    A short story written by the AI - " Henry The Mighty Time Traveler "
    submitted by /u/Alive_Ad_2882 [link] [comments]
    1:37 / 22:17 • Paper Motivation Is Gato Really the Future of AI?
    submitted by /u/bartturner [link] [comments]
    AI poetry in rhyme proving difficult
    Hi, has anyone managed to get an AI to produce poetry in rhyme? I have tried to teach GPT-3 and Jurassic-1 to rhyme in ABAB style, after giving it around 50 example verses. Here is my prompt: "Human: You are AI, a brilliant AI poet. You write deep, beautiful and meaningful poetry in ABAB style. That means that in each 4 line verse, the last word in line 1 rhymes with the last word in line 3, and the last word in line 2 rhymes with the last word in line 4. Here are some examples of ABAB poems. Study how they are structured. Afterwards I will as you to write some ABAB poems like these:" Then I gave all examples. Then I ended with: "Human: That was all the examples. Please write some nice poetry in ABAB style like those poems. Each verse has to be 4 lines. AI: " Tried various tweaks to the prompt, but it really struggles to comprehend rhyme. I have seen perhaps two lines rhyming here and there, but it might be chance. I am using the largest models. Has anyone had any luck getting the models to rhyme? submitted by /u/Dinosaur-Owl [link] [comments]  ( 1 min )
    AI Implementation roadmap
    Take a look at AI implementation roadmap with real case examples https://www.toolbox.com/tech/artificial-intelligence/guest-article/ai-implementation-what-does-it-take-to-adopt-artificial-intelligence-in-business/#SocialMediaInterests submitted by /u/lklimusheuskaja [link] [comments]
    Craft Album.����
    submitted by /u/cookingandcraft [link] [comments]
    How to Leverage Conversational Chatbots to 10x Your E-commerce Sales?
    submitted by /u/mihircontra20 [link] [comments]
    Introduction to OpenCV and Image Processing with Python
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Google maps immersive view - uses AI and computer vision to fuse billions of images with real-time traffic and weather, creating a 3d simulation of the world that shows you the vibe of a place
    submitted by /u/imaginfinity [link] [comments]  ( 2 min )
    Cambridge AI Researchers Propose ‘MAGIC’: A Training-Free Framework That Plugs Visual Controls Into The Generation Of A Language Model
    The release of Generative Pretrained Transformer (GPT-2) has fetched huge attention towards generative language models (LMs), which are pre-trained on massive amounts of unstructured text and have generated efficient results on a variety of NLP applications. LMs can produce texts constantly utilizing a textual prompt’s next-token prediction decoding approach. Models such as CLIP and ALIGN, pre-trained image-text joint embedding approaches, have revived multimodal illustration learning of text and images. Accordingly, it is challenging to integrate the benefits of pre-trained LMs and image-text embedding models to generate visually grounded text. The traditional approaches are generally limited by the object detectors trained with a fixed set of labels. Currently, the ZeroCap approach is ut…  ( 2 min )
    Are There Any Good Entirely Free Text-to-Image AI Generators that have API's?
    I have used wombo and it works really well, but there is no API for it. Are there any other Free Text-to-Image generators that have an API? submitted by /u/xbftw [link] [comments]
  • Open

    BREAKTHROUGH Google Deepmind Gato General AI Does 600+ Tasks With One Transformer Neural Network
    submitted by /u/getrich_or_diemining [link] [comments]
    Social Media networks as aggregate neural networks to explain political polarization and radicalization?
    With recent USA tragedies I find myself wondering if the advancing science of neural networks and machine learning, deep learning etc.. could be used to model or explain how social networks could function as a sort of aggregate neural network of a classical neural network "learning" and "training" that can also gain new nodes but are "trained" to conform to a certain pathway that literally breeds extremist views and polarization? I searched on JSTOR, etc.. but was wondering if anyone in the community knew someone or some research going on in this vein? submitted by /u/1nvent [link] [comments]  ( 1 min )
    Suggestion Regarding ML(Product Selector/Recommender) task
    Hello All, I am a student working on a B2B project called “Digital Product Selector based on questions”. So for example, if you go to a health care company’s website and you want to find out which product among them suits you best. You would have a set of questions on the website and based upon your answers, it would recommend you a product or products of that specific company. We have a static/rule based algorithm working fine when there are less products and less questions to answer for a specific company. However, for a company which has huge product list and more than 20 set of questions, the algorithm take significant amount of time to produce recommended products since being rule based and stuck in calculations. Now, I want to replace the rule based/static algorithm with any machine learning algorithm that I can train using my data. Please answer the following. 1) Can you please recommend me if there are any pre trained Neural Networks that I can use to address this use case? 2) Also, which problem statement does this use case belongs to? 3) we have recently started to work with AWS, if there are any AWS services available to address this use case, please recommend. submitted by /u/mubashir_ali93 [link] [comments]  ( 1 min )
    Breaking into the black box of artificial intelligence
    submitted by /u/nickb [link] [comments]
  • Open

    Mental secure hash function
    A few years ago I wrote about Manual Blum’s proposed method for mentally computing a secure hash function. He proposed using this method as a password manager, using the hash of a web site’s name as the password for the site. I first wrote about Blum’s method on the Heidelberg Laureate Forum blog, then wrote […] Mental secure hash function first appeared on John D. Cook.  ( 5 min )
  • Open

    Personalize your machine translation results by using fuzzy matching with Amazon Translate
    A person’s vernacular is part of the characteristics that make them unique. There are often countless different ways to express one specific idea. When a firm communicates with their customers, it’s critical that the message is delivered in a way that best represents the information they’re trying to convey. This becomes even more important when […]  ( 8 min )
  • Open

    FLUTE: A scalable federated learning simulation platform
    Federated learning has become a major area of machine learning (ML) research in recent years due to its versatility in training complex models over massive amounts of data without the need to share that data with a centralized entity. However, despite this flexibility and the amount of research already conducted, it’s difficult to implement due […] The post FLUTE: A scalable federated learning simulation platform appeared first on Microsoft Research.  ( 6 min )
  • Open

    A robust approach for deep neural networks in presence of label noise: relabelling and filtering instances during training. (arXiv:2109.03748v2 [cs.LG] UPDATED)
    Deep learning has outperformed other machine learning algorithms in a variety of tasks, and as a result, it is widely used. However, like other machine learning algorithms, deep learning, and convolutional neural networks (CNNs) in particular, perform worse when the data sets present label noise. Therefore, it is important to develop algorithms that help the training of deep networks and their generalization to noise-free test sets. In this paper, we propose a robust training strategy against label noise, called RAFNI, that can be used with any CNN. This algorithm filters and relabels instances of the training set based on the predictions and their probabilities made by the backbone neural network during the training process. That way, this algorithm improves the generalization ability of the CNN on its own. RAFNI consists of three mechanisms: two mechanisms that filter instances and one mechanism that relabels instances. In addition, it does not suppose that the noise rate is known nor does it need to be estimated. We evaluated our algorithm using different data sets of several sizes and characteristics. We also compared it with state-of-the-art models using the CIFAR10 and CIFAR100 benchmarks under different types and rates of label noise and found that RAFNI achieves better results in most cases.
    Automatic Monitoring of Fruit Ripening Rooms by UHF RFID Sensor Network and Machine Learning. (arXiv:2204.12415v2 [eess.SY] UPDATED)
    Accelerated ripening through the exposure of fruits to controlled environmental conditions and gases is nowadays one of the most assessed food technologies, especially for climacteric and exotic products. However, a fine granularity control of the process and consequently of the quality of the goods is still missing, so the management of the ripening rooms is mainly based on qualitative estimations only. Following the modern paradigms of Industry 4.0, this contribution proposes a non-destructive RFID-based system for the automatic evaluation of the live ripening of avocados. The system, coupled with a properly trained automatic classification algorithm based on Support Vector Machines (SVMs), can discriminate the stage of ripening with an accuracy greater than 85%.
    Intrinsically Motivated Self-supervised Learning in Reinforcement Learning. (arXiv:2106.13970v2 [cs.LG] UPDATED)
    In vision-based reinforcement learning (RL) tasks, it is prevalent to assign auxiliary tasks with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in auxiliary tasks, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.
    PerfectDou: Dominating DouDizhu with Perfect Information Distillation. (arXiv:2203.16406v4 [cs.AI] UPDATED)
    As a challenging multi-player card game, DouDizhu has recently drawn much attention for analyzing competition and collaboration in imperfect-information games. In this paper, we propose PerfectDou, a state-of-the-art DouDizhu AI system that dominates the game, in an actor-critic framework with a proposed technique named perfect information distillation. In detail, we adopt a perfect-training-imperfect-execution framework that allows the agents to utilize the global information to guide the training of the policies as if it is a perfect information game and the trained policies can be used to play the imperfect information game during the actual gameplay. To this end, we characterize card and game features for DouDizhu to represent the perfect and imperfect information. To train our system, we adopt proximal policy optimization with generalized advantage estimation in a parallel training paradigm. In experiments we show how and why PerfectDou beats all existing AI programs, and achieves state-of-the-art performance.
    Accelerating Part-Scale Simulation in Liquid Metal Jet Additive Manufacturing via Operator Learning. (arXiv:2202.03665v1 [physics.flu-dyn] CROSS LISTED)
    Predicting part quality for additive manufacturing (AM) processes requires high-fidelity numerical simulation of partial differential equations (PDEs) governing process multiphysics on a scale of minimum manufacturable features. This makes part-scale predictions computationally demanding, especially when they require many small-scale simulations. We consider drop-on-demand liquid metal jetting (LMJ) as an illustrative example of such computational complexity. A model describing droplet coalescence for LMJ may include coupled incompressible fluid flow, heat transfer, and phase change equations. Numerically solving these equations becomes prohibitively expensive when simulating the build process for a full part consisting of thousands to millions of droplets. Reduced-order models (ROMs) based on neural networks (NN) or k-nearest neighbor (kNN) algorithms have been built to replace the original physics-based solver and are computationally tractable for part-level simulations. However, their quick inference capabilities often come at the expense of accuracy, robustness, and generalizability. We apply an operator learning (OL) approach to learn a mapping between initial and final states of the droplet coalescence process for enabling rapid and accurate part-scale build simulation. Preliminary results suggest that OL requires order-of-magnitude fewer data points than a kNN approach and is generalizable beyond the training set while achieving similar prediction error.
    Emotion Intensity and its Control for Emotional Voice Conversion. (arXiv:2201.03967v2 [cs.SD] UPDATED)
    Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
    Uncertify: Attacks Against Neural Network Certification. (arXiv:2108.11299v3 [cs.LG] UPDATED)
    A key concept towards reliable, robust, and safe AI systems is the idea to implement fallback strategies when predictions of the AI cannot be trusted. Certifiers for neural networks have made great progress towards provable robustness guarantees against evasion attacks using adversarial examples. These methods guarantee for some predictions that a certain class of manipulations or attacks could not have changed the outcome. For the remaining predictions without guarantees, the method abstains from making a prediction and a fallback strategy needs to be invoked, which is typically more costly, less accurate, or even involves a human operator. While this is a key concept towards safe and secure AI, we show for the first time that this strategy comes with its own security risks, as such fallback strategies can be deliberately triggered by an adversary. In particular, we conduct the first systematic analysis of training-time attacks against certifiers in practical application pipelines, identifying new threat vectors that can be exploited to degrade the overall system. Using these insights, we design two backdoor attacks against network certifiers, which can drastically reduce certified robustness. For example, adding 1% poisoned data during training is sufficient to reduce certified robustness by up to 95 percentage points, effectively rendering the certifier useless. We analyze how such novel attacks can compromise the overall system's integrity or availability. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the wide applicability of these attacks. A first investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, more specific solutions.
    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound. (arXiv:2201.02639v4 [cs.CV] UPDATED)
    As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
    Neurochaos Feature Transformation and Classification for Imbalanced Learning. (arXiv:2205.06742v1 [cs.NE])
    Learning from limited and imbalanced data is a challenging problem in the Artificial Intelligence community. Real-time scenarios demand decision-making from rare events wherein the data are typically imbalanced. These situations commonly arise in medical applications, cybersecurity, catastrophic predictions etc. This motivates development of learning algorithms capable of learning from imbalanced data. Human brain effortlessly learns from imbalanced data. Inspired by the chaotic neuronal firing in the human brain, a novel learning algorithm namely \emph{Neurochaos Learning} (NL) was recently proposed. NL is categorized in three blocks: Feature Transformation, Neurochaos Feature Extraction (CFX), and Classification. In this work, the efficacy of neurochaos feature transformation and extraction for classification in imbalanced learning is studied. We propose a unique combination of neurochaos based feature transformation and extraction with traditional ML algorithms. The explored datasets in this study revolve around medical diagnosis, banknote fraud detection, environmental applications and spoken-digit classification. In this study, experiments are performed in both high and low training sample regime. In the former, five out of nine datasets have shown a performance boost in terms of macro F1-score after using CFX features. The highest performance boost obtained is $\textbf{25.97\%}$ for {\it Statlog (Heart)} dataset using CFX+Decision Tree. In the low training sample regime (from just one to nine training samples per class), the highest performance boost of $\textbf{144.38\%}$ is obtained for {\it Haberman's Survival} dataset using CFX+Random Forest. NL offers enormous flexibility of combining CFX with any ML classifier to boost its performance, especially for learning tasks with limited and imbalanced data.
    Scaling the weight parameters in Markov logic networks and relational logistic regression models. (arXiv:2103.15140v2 [cs.AI] UPDATED)
    We consider Markov logic networks and relational logistic regression as two fundamental representation formalisms in statistical relational artificial intelligence that use weighted formulas in their specification. However, Markov logic networks are based on undirected graphs, while relational logistic regression is based on directed acyclic graphs. We show that when scaling the weight parameters with the domain size, the asymptotic behaviour of a relational logistic regression model is transparently controlled by the parameters, and we supply an algorithm to compute asymptotic probabilities. We also show using two examples that this is not true for Markov logic networks. We also discuss using several examples, mainly from the literature, how the application context can help the user to decide when such scaling is appropriate and when using the raw unscaled parameters might be preferable. We highlight random sampling as a particularly promising area of application for scaled models and expound possible avenues for further research.
    No Weighted-Regret Learning in Adversarial Bandits with Delays. (arXiv:2103.04550v2 [cs.LG] UPDATED)
    Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have "no weighted-regret" in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.
    secml: A Python Library for Secure and Explainable Machine Learning. (arXiv:1912.10013v2 [cs.LG] UPDATED)
    We present \texttt{secml}, an open-source Python library for secure and explainable machine learning. It implements the most popular attacks against machine learning, including test-time evasion attacks to generate adversarial examples against deep neural networks and training-time poisoning attacks against support vector machines and many other algorithms. These attacks enable evaluating the security of learning algorithms and the corresponding defenses under both white-box and black-box threat models. To this end, \texttt{secml} provides built-in functions to compute security evaluation curves, showing how quickly classification performance decreases against increasing adversarial perturbations of the input data. \texttt{secml} also includes explainability methods to help understand why adversarial attacks succeed against a given model, by visualizing the most influential features and training prototypes contributing to each decision. It is distributed under the Apache License 2.0 and hosted at \url{https://github.com/pralab/secml}.
    On the validity of pre-trained transformers for natural language processing in the software engineering domain. (arXiv:2109.04738v2 [cs.SE] UPDATED)
    Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain.
    Research on the correlation between text emotion mining and stock market based on deep learning. (arXiv:2205.06675v1 [q-fin.ST])
    This paper discusses how to crawl the data of financial forums such as stock bar, and conduct emotional analysis combined with the in-depth learning model. This paper will use the Bert model to train the financial corpus and predict the Shenzhen stock index. Through the comparative study of the maximal information coefficient (MIC), it is found that the emotional characteristics obtained by applying the BERT model to the financial corpus can be reflected in the fluctuation of the stock market, which is conducive to effectively improve the prediction accuracy. At the same time, this paper combines in-depth learning with financial texts to further explore the impact mechanism of investor sentiment on the stock market through in-depth learning, which will help the national regulatory authorities and policy departments to formulate more reasonable policies and guidelines for maintaining the stability of the stock market.
    Artificial Intelligence-Assisted Optimization and Multiphase Analysis of Polygon PEM Fuel Cells. (arXiv:2205.06768v1 [cs.NE])
    This article presents new PEM fuel cell models with hexagonal and pentagonal designs. After observing cell performance improvement in these models, we optimized them. Inlet pressure and temperature were used as input parameters, and consumption and output power were the target parameters of the multi-objective optimization algorithm. Then we used artificial intelligence techniques, including deep neural networks and polynomial regression, to model the data. Next, we employed the RSM (Response Surface Method) method to derive the target functions. Furthermore, we applied the NSGA-II multi-objective genetic algorithm to optimize the targets. Compared to the base model (Cubic), the optimized Pentagonal and Hexagonal models averagely increase the output current density by 21.819% and 39.931%, respectively.
    Principal-Agent Hypothesis Testing. (arXiv:2205.06812v1 [cs.GT])
    Consider the relationship between the FDA (the principal) and a pharmaceutical company (the agent). The pharmaceutical company wishes to sell a product to make a profit, and the FDA wishes to ensure that only efficacious drugs are released to the public. The efficacy of the drug is not known to the FDA, so the pharmaceutical company must run a costly trial to prove efficacy to the FDA. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested pharmaceutical company; a lower standard of statistical evidence incentivizes the pharmaceutical company to run more trials for drugs that are less likely to be effective, since the drug may pass the trial by chance, resulting in large profits. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial to understanding this system and designing protocols with high social utility. In this work, we discuss how the principal and agent can enter into a contract with payoffs based on statistical evidence. When there is stronger evidence for the quality of the product, the principal allows the agent to make a larger profit. We show how to design contracts that are robust to an agent's strategic actions, and derive the optimal contract in the presence of strategic behavior.
    Kronecker Decomposition for Knowledge Graph Embeddings. (arXiv:2205.06560v1 [cs.LG])
    Knowledge graph embedding research has mainly focused on learning continuous representations of entities and relations tailored towards the link prediction problem. Recent results indicate an ever increasing predictive ability of current approaches on benchmark datasets. However, this effectiveness often comes with the cost of over-parameterization and increased computationally complexity. The former induces extensive hyperparameter optimization to mitigate malicious overfitting. The latter magnifies the importance of winning the hardware lottery. Here, we investigate a remedy for the first problem. We propose a technique based on Kronecker decomposition to reduce the number of parameters in a knowledge graph embedding model, while retaining its expressiveness. Through Kronecker decomposition, large embedding matrices are split into smaller embedding matrices during the training process. Hence, embeddings of knowledge graphs are not plainly retrieved but reconstructed on the fly. The decomposition ensures that elementwise interactions between three embedding vectors are extended with interactions within each embedding vector. This implicitly reduces redundancy in embedding vectors and encourages feature reuse. To quantify the impact of applying Kronecker decomposition on embedding matrices, we conduct a series of experiments on benchmark datasets. Our experiments suggest that applying Kronecker decomposition on embedding matrices leads to an improved parameter efficiency on all benchmark datasets. Moreover, empirical evidence suggests that reconstructed embeddings entail robustness against noise in the input knowledge graph. To foster reproducible research, we provide an open-source implementation of our approach, including training and evaluation scripts as well as pre-trained models in our knowledge graph embedding framework (https://github.com/dice-group/dice-embeddings).
    Graph Attention Networks for Channel Estimation in RIS-assisted Satellite IoT Communications. (arXiv:2104.00735v3 [cs.NI] UPDATED)
    Direct-to-satellite (DtS) communication has gained importance recently to support globally connected Internet of things (IoT) networks. However, relatively long distances of densely deployed satellite networks around the Earth cause a high path loss. In addition, since high complexity operations such as beamforming, tracking and equalization have to be performed in IoT devices partially, both the hardware complexity and the need for high-capacity batteries of IoT devices increase. The reconfigurable intelligent surfaces (RISs) have the potential to increase the energy-efficiency and to perform complex signal processing over the transmission environment instead of IoT devices. But, RISs need the information of the cascaded channel in order to change the phase of the incident signal. This study evaluates the pilot signal as a graph and incorporates this information into the graph attention networks (GATs) to track the phase relation through pilot signaling. Proposed GAT based channel estimation method examines the performance of the DtS IoT networks for different RIS configurations to solve the challenging channel estimation problem. It is shown that the proposed GAT both demonstrates a higher performance with increased robustness under changing conditions and has lower computational complexity compared to conventional deep learning methods. Moreover, bit error rate performance is investigated for RIS designs with discrete and non-uniform phase shifts under channel estimation based on the proposed method. One of the findings in this study is that the channel models of the operating environment and the performance of the channel estimation method must be considered during RIS design to exploit performance improvement as far as possible.
    On the Importance of Architecture and Feature Selection in Differentially Private Machine Learning. (arXiv:2205.06720v1 [cs.CR])
    We study a pitfall in the typical workflow for differentially private machine learning. The use of differentially private learning algorithms in a "drop-in" fashion -- without accounting for the impact of differential privacy (DP) noise when choosing what feature engineering operations to use, what features to select, or what neural network architecture to use -- yields overly complex and poorly performing models. In other words, by anticipating the impact of DP noise, a simpler and more accurate alternative model could have been trained for the same privacy guarantee. We systematically study this phenomenon through theory and experiments. On the theory front, we provide an explanatory framework and prove that the phenomenon arises naturally from the addition of noise to satisfy differential privacy. On the experimental front, we demonstrate how the phenomenon manifests in practice using various datasets, types of models, tasks, and neural network architectures. We also analyze the factors that contribute to the problem and distill our experimental insights into concrete takeaways that practitioners can follow when training models with differential privacy. Finally, we propose privacy-aware algorithms for feature selection and neural network architecture search. We analyze their differential privacy properties and evaluate them empirically.
    Explaining by Removing: A Unified Framework for Model Explanation. (arXiv:2011.14878v2 [cs.LG] UPDATED)
    Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.
    Variational Hyper-Encoding Networks. (arXiv:2005.08482v2 [stat.ML] UPDATED)
    We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
    Distributed Transmission Control for Wireless Networks using Multi-Agent Reinforcement Learning. (arXiv:2205.06800v1 [cs.LG])
    We examine the problem of transmission control, i.e., when to transmit, in distributed wireless communications networks through the lens of multi-agent reinforcement learning. Most other works using reinforcement learning to control or schedule transmissions use some centralized control mechanism, whereas our approach is fully distributed. Each transmitter node is an independent reinforcement learning agent and does not have direct knowledge of the actions taken by other agents. We consider the case where only a subset of agents can successfully transmit at a time, so each agent must learn to act cooperatively with other agents. An agent may decide to transmit a certain number of steps into the future, but this decision is not communicated to the other agents, so it the task of the individual agents to attempt to transmit at appropriate times. We achieve this collaborative behavior through studying the effects of different actions spaces. We are agnostic to the physical layer, which makes our approach applicable to many types of networks. We submit that approaches similar to ours may be useful in other domains that use multi-agent reinforcement learning with independent agents.
    Goal-Guided Neural Cellular Automata: Learning to Control Self-Organising Systems. (arXiv:2205.06806v1 [cs.NE])
    Inspired by cellular growth and self-organization, Neural Cellular Automata (NCAs) have been capable of "growing" artificial cells into images, 3D structures, and even functional machines. NCAs are flexible and robust computational systems but -- similarly to many other self-organizing systems -- inherently uncontrollable during and after their growth process. We present an approach to control these type of systems called Goal-Guided Neural Cellular Automata (GoalNCA), which leverages goal encodings to control cell behavior dynamically at every step of cellular growth. This approach enables the NCA to continually change behavior, and in some cases, generalize its behavior to unseen scenarios. We also demonstrate the robustness of the NCA with its ability to preserve task performance, even when only a portion of cells receive goal information.
    Provably Safe Reinforcement Learning: A Theoretical and Experimental Comparison. (arXiv:2205.06750v1 [cs.LG])
    Ensuring safety of reinforcement learning (RL) algorithms is crucial for many real-world tasks. However, vanilla RL does not guarantee safety for an agent. In recent years, several methods have been proposed to provide safety guarantees for RL. To the best of our knowledge, there is no comprehensive comparison of these provably safe RL methods. We therefore introduce a categorization for existing provably safe RL methods, and present the theoretical foundations for both continuous and discrete action spaces. Additionally, we evaluate provably safe RL on an inverted pendulum. In the experiments, it is shown that indeed only provably safe RL methods guarantee safety.
    Precise Change Point Detection using Spectral Drift Detection. (arXiv:2205.06507v1 [cs.LG])
    The notion of concept drift refers to the phenomenon that the data generating distribution changes over time; as a consequence machine learning models may become inaccurate and need adjustment. In this paper we consider the problem of detecting those change points in unsupervised learning. Many unsupervised approaches rely on the discrepancy between the sample distributions of two time windows. This procedure is noisy for small windows, hence prone to induce false positives and not able to deal with more than one drift event in a window. In this paper we rely on structural properties of drift induced signals, which use spectral properties of kernel embedding of distributions. Based thereon we derive a new unsupervised drift detection algorithm, investigate its mathematical properties, and demonstrate its usefulness in several experiments.
    Transformation-Interaction-Rational Representation for Symbolic Regression. (arXiv:2205.06807v1 [cs.NE])
    Symbolic Regression searches for a function form that approximates a dataset often using Genetic Programming. Since there is usually no restriction to what form the function can have, Genetic Programming may return a hard to understand model due to non-linear function chaining or long expressions. A novel representation called Interaction-Transformation was recently proposed to alleviate this problem. In this representation, the function form is restricted to an affine combination of terms generated as the application of a single univariate function to the interaction of selected variables. This representation obtained competing solutions on standard benchmarks. Despite the initial success, a broader set of benchmarking functions revealed the limitations of the constrained representation. In this paper we propose an extension to this representation, called Transformation-Interaction-Rational representation that defines a new function form as the rational of two Interaction-Transformation functions. Additionally, the target variable can also be transformed with an univariate function. The main goal is to improve the approximation power while still constraining the overall complexity of the expression. We tested this representation with a standard Genetic Programming with crossover and mutation. The results show a great improvement when compared to its predecessor and a state-of-the-art performance for a large benchmark.
    Interlock-Free Multi-Aspect Rationalization for Text Classification. (arXiv:2205.06756v1 [cs.CL])
    Explanation is important for text classification tasks. One prevalent type of explanation is rationales, which are text snippets of input text that suffice to yield the prediction and are meaningful to humans. A lot of research on rationalization has been based on the selective rationalization framework, which has recently been shown to be problematic due to the interlocking dynamics. In this paper, we show that we address the interlocking problem in the multi-aspect setting, where we aim to generate multiple rationales for multiple outputs. More specifically, we propose a multi-stage training method incorporating an additional self-supervised contrastive loss that helps to generate more semantically diverse rationales. Empirical results on the beer review dataset show that our method improves significantly the rationalization performance.
    Embodied-Symbolic Contrastive Graph Self-Supervised Learning for Molecular Graphs. (arXiv:2205.06783v1 [cs.LG])
    Dual embodied-symbolic concept representations are the foundation for deep learning and symbolic AI integration. We discuss the use of dual embodied-symbolic concept representations for molecular graph representation learning, specifically with exemplar-based contrastive self-supervised learning (SSL). The embodied representations are learned from molecular graphs, and the symbolic representations are learned from the corresponding Chemical knowledge graph (KG). We use the Chemical KG to enhance molecular graphs with symbolic (semantic) knowledge and generate their augmented molecular graphs. We treat a molecular graph and its semantically augmented molecular graph as exemplars of the same semantic class, and use the pairs as positive pairs in exemplar-based contrastive SSL.
    Self-Sampling for Neural Point Cloud Consolidation. (arXiv:2008.06471v3 [cs.GR] UPDATED)
    We introduce a novel technique for neural point cloud consolidation which learns from only the input point cloud. Unlike other point upsampling methods which analyze shapes via local patches, in this work, we learn from global subsets. We repeatedly self-sample the input point cloud with global subsets that are used to train a deep neural network. Specifically, we define source and target subsets according to the desired consolidation criteria (e.g., generating sharp points or points in sparse regions). The network learns a mapping from source to target subsets, and implicitly learns to consolidate the point cloud. During inference, the network is fed with random subsets of points from the input, which it displaces to synthesize a consolidated point set. We leverage the inductive bias of neural networks to eliminate noise and outliers, a notoriously difficult problem in point cloud consolidation. The shared weights of the network are optimized over the entire shape, learning non-local statistics and exploiting the recurrence of local-scale geometries. Specifically, the network encodes the distribution of the underlying shape surface within a fixed set of local kernels, which results in the best explanation of the underlying shape surface. We demonstrate the ability to consolidate point sets from a variety of shapes, while eliminating outliers and noise.
    Exploring the structure-property relations of thin-walled, 2D extruded lattices using neural networks. (arXiv:2205.06761v1 [cs.LG])
    This paper investigates the structure-property relations of thin-walled lattices under dynamic longitudinal compression, characterized by their cross-sections and heights. These relations elucidate the interactions of different geometric features of a design on mechanical response, including energy absorption. We proposed a combinatorial, key-based design system to generate different lattice designs and used the finite element method to simulate their response with the Johnson-Cook material model. Using an autoencoder, we encoded the cross-sectional images of the lattices into latent design feature vectors, which were supplied to the neural network model to generate predictions. The trained models can accurately predict lattice energy absorption curves in the key-based design system and can be extended to new designs outside of the system via transfer learning.
    Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning. (arXiv:2205.06760v1 [cs.AI])
    Advances in artificial intelligence often stem from the development of new environments that abstract real-world situations into a form where research can be done conveniently. This paper contributes such an environment based on ideas inspired by elementary Microeconomics. Agents learn to produce resources in a spatially complex world, trade them with one another, and consume those that they prefer. We show that the emergent production, consumption, and pricing behaviors respond to environmental conditions in the directions predicted by supply and demand shifts in Microeconomics. We also demonstrate settings where the agents' emergent prices for goods vary over space, reflecting the local abundance of goods. After the price disparities emerge, some agents then discover a niche of transporting goods between regions with different prevailing prices -- a profitable strategy because they can buy goods where they are cheap and sell them where they are expensive. Finally, in a series of ablation experiments, we investigate how choices in the environmental rewards, bartering actions, agent architecture, and ability to consume tradable goods can either aid or inhibit the emergence of this economic behavior. This work is part of the environment development branch of a research program that aims to build human-like artificial general intelligence through multi-agent interactions in simulated societies. By exploring which environment features are needed for the basic phenomena of elementary microeconomics to emerge automatically from learning, we arrive at an environment that differs from those studied in prior multi-agent reinforcement learning work along several dimensions. For example, the model incorporates heterogeneous tastes and physical abilities, and agents negotiate with one another as a grounded form of communication.
    Imaging Conductivity from Current Density Magnitude using Neural Networks. (arXiv:2204.02441v3 [math.NA] UPDATED)
    Conductivity imaging represents one of the most important tasks in medical imaging. In this work we develop a neural network based reconstruction technique for imaging the conductivity from the magnitude of the internal current density. It is achieved by formulating the problem as a relaxed weighted least-gradient problem, and then approximating its minimizer by standard fully connected feedforward neural networks. We derive bounds on two components of the generalization error, i.e., approximation error and statistical error, explicitly in terms of properties of the neural networks (e.g., depth, total number of parameters, and the bound of the network parameters). We illustrate the performance and distinct features of the approach on several numerical experiments. Numerically, it is observed that the approach enjoys remarkable robustness with respect to the presence of data noise.
    Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes. (arXiv:2205.06621v1 [cs.CL])
    To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.
    Learning Keypoints from Synthetic Data for Robotic Cloth Folding. (arXiv:2205.06714v1 [cs.RO])
    Robotic cloth manipulation is challenging due to its deformability, which makes determining its full state infeasible. However, for cloth folding, it suffices to know the position of a few semantic keypoints. Convolutional neural networks (CNN) can be used to detect these keypoints, but require large amounts of annotated data, which is expensive to collect. To overcome this, we propose to learn these keypoint detectors purely from synthetic data, enabling low-cost data collection. In this paper, we procedurally generate images of towels and use them to train a CNN. We evaluate the performance of this detector for folding towels on a unimanual robot setup and find that the grasp and fold success rates are 77% and 53%, respectively. We conclude that learning keypoint detectors from synthetic data for cloth folding and related tasks is a promising research direction, discuss some failures and relate them to future work. A video of the system, as well as the codebase, more details on the CNN architecture and the training setup can be found at https://github.com/tlpss/workshop-icra-2022-cloth-keypoints.git.
    Federated Learning Under Intermittent Client Availability and Time-Varying Communication Constraints. (arXiv:2205.06730v1 [cs.LG])
    Federated learning systems facilitate training of global models in settings where potentially heterogeneous data is distributed across a large number of clients. Such systems operate in settings with intermittent client availability and/or time-varying communication constraints. As a result, the global models trained by federated learning systems may be biased towards clients with higher availability. We propose F3AST, an unbiased algorithm that dynamically learns an availability-dependent client selection strategy which asymptotically minimizes the impact of client-sampling variance on the global model convergence, enhancing performance of federated learning. The proposed algorithm is tested in a variety of settings for intermittently available clients under communication constraints, and its efficacy demonstrated on synthetic data and realistically federated benchmarking experiments using CIFAR100 and Shakespeare datasets. We show up to 186% and 8% accuracy improvements over FedAvg, and 8% and 7% over FedAdam on CIFAR100 and Shakespeare, respectively.
    EyeDAS: Securing Perception of Autonomous Cars Against the Stereoblindness Syndrome. (arXiv:2205.06765v1 [cs.LG])
    The ability to detect whether an object is a 2D or 3D object is extremely important in autonomous driving, since a detection error can have life-threatening consequences, endangering the safety of the driver, passengers, pedestrians, and others on the road. Methods proposed to distinguish between 2 and 3D objects (e.g., liveness detection methods) are not suitable for autonomous driving, because they are object dependent or do not consider the constraints associated with autonomous driving (e.g., the need for real-time decision-making while the vehicle is moving). In this paper, we present EyeDAS, a novel few-shot learning-based method aimed at securing an object detector (OD) against the threat posed by the stereoblindness syndrome (i.e., the inability to distinguish between 2D and 3D objects). We evaluate EyeDAS's real-time performance using 2,000 objects extracted from seven YouTube video recordings of street views taken by a dash cam from the driver's seat perspective. When applying EyeDAS to seven state-of-the-art ODs as a countermeasure, EyeDAS was able to reduce the 2D misclassification rate from 71.42-100% to 2.4% with a 3D misclassification rate of 0% (TPR of 1.0). We also show that EyeDAS outperforms the baseline method and achieves an AUC of over 0.999 and a TPR of 1.0 with an FPR of 0.024.
    Univariate and Multivariate LSTM Model for Short-Term Stock Market Prediction. (arXiv:2205.06673v1 [q-fin.ST])
    Designing robust and accurate prediction models has been a viable research area since a long time. While proponents of a well-functioning market predictors believe that it is difficult to accurately predict market prices but many scholars disagree. Robust and accurate prediction systems will not only be helpful to the businesses but also to the individuals in making their financial investments. This paper presents an LSTM model with two different input approaches for predicting the short-term stock prices of two Indian companies, Reliance Industries and Infosys Ltd. Ten years of historic data (2012-2021) is taken from the yahoo finance website to carry out analysis of proposed approaches. In the first approach, closing prices of two selected companies are directly applied on univariate LSTM model. For the approach second, technical indicators values are calculated from the closing prices and then collectively applied on Multivariate LSTM model. Short term market behaviour for upcoming days is evaluated. Experimental outcomes revel that approach one is useful to determine the future trend but multivariate LSTM model with technical indicators found to be useful in accurately predicting the future price behaviours.
    Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets. (arXiv:2205.06595v1 [stat.ML])
    Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
    A Vision Inspired Neural Network for Unsupervised Anomaly Detection in Unordered Data. (arXiv:2205.06716v1 [cs.LG])
    A fundamental problem in the field of unsupervised machine learning is the detection of anomalies corresponding to rare and unusual observations of interest; reasons include for their rejection, accommodation or further investigation. Anomalies are intuitively understood to be something unusual or inconsistent, whose occurrence sparks immediate attention. More formally anomalies are those observations-under appropriate random variable modelling-whose expectation of occurrence with respect to a grouping of prior interest is less than one; such a definition and understanding has been used to develop the parameter-free perception anomaly detection algorithm. The present work seeks to establish important and practical connections between the approach used by the perception algorithm and prior decades of research in neurophysiology and computational neuroscience; particularly that of information processing in the retina and visual cortex. The algorithm is conceptualised as a neuron model which forms the kernel of an unsupervised neural network that learns to signal unexpected observations as anomalies. Both the network and neuron display properties observed in biological processes including: immediate intelligence; parallel processing; redundancy; global degradation; contrast invariance; parameter-free computation, dynamic thresholds and non-linear processing. A robust and accurate model for anomaly detection in univariate and multivariate data is built using this network as a concrete application.
    Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction. (arXiv:2205.06672v1 [cs.CV])
    Classical multiple instance learning (MIL) methods are often based on the identical and independent distributed assumption between instances, hence neglecting the potentially rich contextual information beyond individual entities. On the other hand, Transformers with global self-attention modules have been proposed to model the interdependencies among all instances. However, in this paper we question: Is global relation modeling using self-attention necessary, or can we appropriately restrict self-attention calculations to local regimes in large-scale whole slide images (WSIs)? We propose a general-purpose local attention graph-based Transformer for MIL (LA-MIL), introducing an inductive bias by explicitly contextualizing instances in adaptive local regimes of arbitrary size. Additionally, an efficiently adapted loss function enables our approach to learn expressive WSI embeddings for the joint analysis of multiple biomarkers. We demonstrate that LA-MIL achieves state-of-the-art results in mutation prediction for gastrointestinal cancer, outperforming existing models on important biomarkers such as microsatellite instability for colorectal cancer. This suggests that local self-attention sufficiently models dependencies on par with global modules. Our implementation will be published.
    Multiple Domain Causal Networks. (arXiv:2205.06791v1 [stat.ML])
    Observational studies are regarded as economic alternatives to randomized trials, often used in their stead to investigate and determine treatment efficacy. Due to lack of sample size, observational studies commonly combine data from multiple sources or different sites/centers. Despite the benefits of an increased sample size, a naive combination of multicenter data may result in incongruities stemming from center-specific protocols for generating cohorts or reactions towards treatments distinct to a given center, among other things. These issues arise in a variety of other contexts, including capturing a treatment effect related to an individual's unique biological characteristics. Existing methods for estimating heterogeneous treatment effects have not adequately addressed the multicenter context, but rather treat it simply as a means to obtain sufficient sample size. Additionally, previous approaches to estimating treatment effects do not straightforwardly generalize to the multicenter design, especially when required to provide treatment insights for patients from a new, unobserved center. To address these shortcomings, we propose Multiple Domain Causal Networks (MDCN), an approach that simultaneously strengthens the information sharing between similar centers while addressing the selection bias in treatment assignment through learning of a new feature embedding. In empirical evaluations, MDCN is consistently more accurate when estimating the heterogeneous treatment effect in new centers compared to benchmarks that adjust solely based on treatment imbalance or general center differences. Finally, we justify our approach by providing theoretical analyses that demonstrate that MDCN improves on the generalization bound of the new, unobserved target center.
    The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation. (arXiv:2205.06618v1 [cs.CL])
    Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with respect to automatic quality metrics in prior work, we show that it can fail to select the right set of output words, particularly for semantically non-compositional linguistic phenomena such as idiomatic expressions, leading to reduced translation quality as perceived by humans. Trading off latency for quality by increasing the size of the allowed set is often not an option in real-world scenarios. We propose a model of vocabulary selection, integrated into the neural translation model, that predicts the set of allowed output words from contextualized encoder representations. This restores translation quality of an unconstrained system, as measured by human evaluations on WMT newstest2020 and idiomatic expressions, at an inference latency competitive with alignment-based selection using aggressive thresholds, thereby removing the dependency on separately trained alignment models.
    Accelerometry-based classification of circulatory states during out-of-hospital cardiac arrest. (arXiv:2205.06540v1 [eess.SP])
    Objective: During cardiac arrest treatment, a reliable detection of spontaneous circulation, usually performed by manual pulse checks, is both vital for patient survival and practically challenging. Methods: We developed a machine learning algorithm to automatically predict the circulatory state during cardiac arrest treatment from 4-second-long snippets of accelerometry and electrocardiogram data from real-world defibrillator records. The algorithm was trained based on 917 cases from the German Resuscitation Registry, for which ground truth labels were created by a manual annotation of physicians. It uses a kernelized Support Vector Machine classifier based on 14 features, which partially reflect the correlation between accelerometry and electrocardiogram data. Results: On a test data set, the proposed algorithm exhibits an accuracy of 94.4 (93.6, 95.2)%, a sensitivity of 95.0 (93.9, 96.1)%, and a specificity of 93.9 (92.7, 95.1)%. Conclusion and significance: In application, the algorithm may be used to simplify retrospective annotation for quality management and, moreover, to support clinicians to assess circulatory state during cardiac arrest treatment.
    Heavy-Tail Phenomenon in Decentralized SGD. (arXiv:2205.06689v1 [stat.ML])
    Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD %addition of network structure can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.
    Improving Astronomical Time-series Classification via Data Augmentation with Generative Adversarial Networks. (arXiv:2205.06758v1 [astro-ph.IM])
    Due to the latest advances in technology, telescopes with significant sky coverage will produce millions of astronomical alerts per night that must be classified both rapidly and automatically. Currently, classification consists of supervised machine learning algorithms whose performance is limited by the number of existing annotations of astronomical objects and their highly imbalanced class distributions. In this work, we propose a data augmentation methodology based on Generative Adversarial Networks (GANs) to generate a variety of synthetic light curves from variable stars. Our novel contributions, consisting of a resampling technique and an evaluation metric, can assess the quality of generative models in unbalanced datasets and identify GAN-overfitting cases that the Fr\'echet Inception Distance does not reveal. We applied our proposed model to two datasets taken from the Catalina and Zwicky Transient Facility surveys. The classification accuracy of variable stars is improved significantly when training with synthetic data and testing with real data with respect to the case of using only real data.
    On the Existence of Simpler Machine Learning Models. (arXiv:1908.01755v4 [cs.LG] UPDATED)
    It is almost always easier to find an accurate-but-complex model than an accurate-yet-simple model. Finding optimal, sparse, accurate models of various forms (linear models with integer coefficients, decision sets, rule lists, decision trees) is generally NP-hard. We often do not know whether the search for a simpler model will be worthwhile, and thus we do not go to the trouble of searching for one. In this work, we ask an important practical question: can accurate-yet-simple models be proven to exist, or shown likely to exist, before explicitly searching for them? We hypothesize that there is an important reason that simple-yet-accurate models often do exist. This hypothesis is that the size of the Rashomon set is often large, where the Rashomon set is the set of almost-equally-accurate models from a function class. If the Rashomon set is large, it contains numerous accurate models, and perhaps at least one of them is the simple model we desire. In this work, we formally present the Rashomon ratio as a new gauge of simplicity for a learning problem, depending on a function class and a data set. The Rashomon ratio is the ratio of the volume of the set of accurate models to the volume of the hypothesis space, and it is different from standard complexity measures from statistical learning theory. Insight from studying the Rashomon ratio provides an easy way to check whether a simpler model might exist for a problem before finding it, namely whether several different machine learning methods achieve similar performance on the data. In that sense, the Rashomon ratio is a powerful tool for understanding why and when an accurate-yet-simple model might exist. If, as we hypothesize in this work, many real-world data sets admit large Rashomon sets, the implications are vast: it means that simple or interpretable models may often be used for high-stakes decisions without losing accuracy.
    Tensor Decompositions for Hyperspectral Data Processing in Remote Sensing: A Comprehensive Review. (arXiv:2205.06407v1 [cs.CV])
    Owing to the rapid development of sensor technology, hyperspectral (HS) remote sensing (RS) imaging has provided a significant amount of spatial and spectral information for the observation and analysis of the Earth's surface at a distance of data acquisition devices, such as aircraft, spacecraft, and satellite. The recent advancement and even revolution of the HS RS technique offer opportunities to realize the full potential of various applications, while confronting new challenges for efficiently processing and analyzing the enormous HS acquisition data. Due to the maintenance of the 3-D HS inherent structure, tensor decomposition has aroused widespread concern and research in HS data processing tasks over the past decades. In this article, we aim at presenting a comprehensive overview of tensor decomposition, specifically contextualizing the five broad topics in HS data processing, and they are HS restoration, compressed sensing, anomaly detection, super-resolution, and spectral unmixing. For each topic, we elaborate on the remarkable achievements of tensor decomposition models for HS RS with a pivotal description of the existing methodologies and a representative exhibition on the experimental results. As a result, the remaining challenges of the follow-up research directions are outlined and discussed from the perspective of the real HS RS practices and tensor decomposition merged with advanced priors and even with deep neural networks. This article summarizes different tensor decomposition-based HS data processing methods and categorizes them into different classes from simple adoptions to complex combinations with other priors for the algorithm beginners. We also expect this survey can provide new investigations and development trends for the experienced researchers who understand tensor decomposition and HS RS to some extent.
    FastSTMF: Efficient tropical matrix factorization algorithm for sparse data. (arXiv:2205.06619v1 [cs.LG])
    Matrix factorization, one of the most popular methods in machine learning, has recently benefited from introducing non-linearity in prediction tasks using tropical semiring. The non-linearity enables a better fit to extreme values and distributions, thus discovering high-variance patterns that differ from those found by standard linear algebra. However, the optimization process of various tropical matrix factorization methods is slow. In our work, we propose a new method FastSTMF based on Sparse Tropical Matrix Factorization (STMF), which introduces a novel strategy for updating factor matrices that results in efficient computational performance. We evaluated the efficiency of FastSTMF on synthetic and real gene expression data from the TCGA database, and the results show that FastSTMF outperforms STMF in both accuracy and running time. Compared to NMF, we show that FastSTMF performs better on some datasets and is not prone to overfitting as NMF. This work sets the basis for developing other matrix factorization techniques based on many other semirings using a new proposed optimization process.
    Deep Reinforcement Learning for Computational Fluid Dynamics on HPC Systems. (arXiv:2205.06502v1 [cs.LG])
    Reinforcement learning (RL) is highly suitable for devising control strategies in the context of dynamical systems. A prominent instance of such a dynamical system is the system of equations governing fluid dynamics. Recent research results indicate that RL-augmented computational fluid dynamics (CFD) solvers can exceed the current state of the art, for example in the field of turbulence modeling. However, while in supervised learning, the training data can be generated a priori in an offline manner, RL requires constant run-time interaction and data exchange with the CFD solver during training. In order to leverage the potential of RL-enhanced CFD, the interaction between the CFD solver and the RL algorithm thus have to be implemented efficiently on high-performance computing (HPC) hardware. To this end, we present Relexi as a scalable RL framework that bridges the gap between machine learning workflows and modern CFD solvers on HPC systems providing both components with its specialized hardware. Relexi is built with modularity in mind and allows easy integration of various HPC solvers by means of the in-memory data transfer provided by the SmartSim library. Here, we demonstrate that the Relexi framework can scale up to hundreds of parallel environment on thousands of cores. This allows to leverage modern HPC resources to either enable larger problems or faster turnaround times. Finally, we demonstrate the potential of an RL-augmented CFD solver by finding a control strategy for optimal eddy viscosity selection in large eddy simulations.
    Two-layer neural networks with values in a Banach space. (arXiv:2105.02095v2 [cs.LG] UPDATED)
    We study two-layer neural networks whose domain and range are Banach spaces with separable preduals. In addition, we assume that the image space is equipped with a partial order, i.e. it is a Riesz space. As the nonlinearity we choose the lattice operation of taking the positive part; in case of $\mathbb R^d$-valued neural networks this corresponds to the ReLU activation function. We prove inverse and direct approximation theorems with Monte-Carlo rates for a certain class of functions, extending existing results for the finite-dimensional case. In the second part of the paper, we study, from the regularisation theory viewpoint, the problem of finding optimal representations of such functions via signed measures on a latent space from a finite number of noisy observations. We discuss regularity conditions known as source conditions and obtain convergence rates in a Bregman distance for the representing measure in the regime when both the noise level goes to zero and the number of samples goes to infinity at appropriate rates.
    Discovering the building blocks of dark matter halo density profiles with neural networks. (arXiv:2203.08827v2 [astro-ph.CO] UPDATED)
    The density profiles of dark matter halos are typically modeled using empirical formulae fitted to the density profiles of relaxed halo populations. We present a neural network model that is trained to learn the mapping from the raw density field containing each halo to the dark matter density profile. We show that the model recovers the widely-used Navarro-Frenk-White (NFW) profile out to the virial radius, and can additionally describe the variability in the outer profile of the halos. The neural network architecture consists of a supervised encoder-decoder framework, which first compresses the density inputs into a low-dimensional latent representation, and then outputs $\rho(r)$ for any desired value of radius $r$. The latent representation contains all the information used by the model to predict the density profiles. This allows us to interpret the latent representation by quantifying the mutual information between the representation and the halos' ground-truth density profiles. A two-dimensional representation is sufficient to accurately model the density profiles up to the virial radius; however, a three-dimensional representation is required to describe the outer profiles beyond the virial radius. The additional dimension in the representation contains information about the infalling material in the outer profiles of dark matter halos, thus discovering the splashback boundary of halos without prior knowledge of the halos' dynamical history.
    Space4HGNN: A Novel, Modularized and Reproducible Platform to Evaluate Heterogeneous Graph Neural Network. (arXiv:2202.09177v2 [cs.LG] UPDATED)
    Heterogeneous Graph Neural Network (HGNN) has been successfully employed in various tasks, but we cannot accurately know the importance of different design dimensions of HGNNs due to diverse architectures and applied scenarios. Besides, in the research community of HGNNs, implementing and evaluating various tasks still need much human effort. To mitigate these issues, we first propose a unified framework covering most HGNNs, consisting of three components: heterogeneous linear transformation, heterogeneous graph transformation, and heterogeneous message passing layer. Then we build a platform Space4HGNN by defining a design space for HGNNs based on the unified framework, which offers modularized components, reproducible implementations, and standardized evaluation for HGNNs. Finally, we conduct experiments to analyze the effect of different designs. With the insights found, we distill a condensed design space and verify its effectiveness.
    Stability to Deformations in Manifold Neural Networks. (arXiv:2106.03725v2 [cs.LG] UPDATED)
    Stability is an important property of graph neural networks (GNNs) which explains their success in many problems of practical interest. Existing GNN stability results depend on the size of the graph, restricting applicability to graphs of moderate size. To understand the stability properties of GNNs on large graphs, we define manifold convolutions and consider neural networks supported on manifolds. These are defined in terms of manifold diffusions mediated by the Laplace-Beltrami (LB) operator and are interpreted as limits of GNNs running on graphs of growing size. We define manifold deformations and show that they lead to perturbations of the manifold's LB operator that consist of an absolute and a relative perturbation term. We then define two frequency dependent manifold filters that split the infinite dimensional spectrum of the LB operator in finite partitions, and prove that these filters are stable to absolute and relative perturbations of the LB operator respectively. We also observe a trade-off between the stability and the discriminability from the stability bounds. Moreover, manifold neural networks (MNNs) composed of these filters inherit the stability properties while the nonlinear activation function helps to improve the discriminability. Therefore, the MNNs can be both stable and discriminative. We verify our results numerically in shape classification with point cloud datasets.
    Data-Driven Upper Bounds on Channel Capacity. (arXiv:2205.06471v1 [cs.IT])
    We consider the problem of estimating an upper bound on the capacity of a memoryless channel with unknown channel law and continuous output alphabet. A novel data-driven algorithm is proposed that exploits the dual representation of capacity where the maximization over the input distribution is replaced with a minimization over a reference distribution on the channel output. To efficiently compute the required divergence maximization between the conditional channel and the reference distribution, we use a modified mutual information neural estimator that takes the channel input as an additional parameter. We evaluate our approach on different memoryless channels and show that the estimated upper bounds closely converge either to the channel capacity or to best-known lower bounds.
    Detecting Rumours with Latency Guarantees using Massive Streaming Data. (arXiv:2205.06580v1 [cs.SI])
    Today's social networks continuously generate massive streams of data, which provide a valuable starting point for the detection of rumours as soon as they start to propagate. However, rumour detection faces tight latency bounds, which cannot be met by contemporary algorithms, given the sheer volume of high-velocity streaming data emitted by social networks. Hence, in this paper, we argue for best-effort rumour detection that detects most rumours quickly rather than all rumours with a high delay. To this end, we combine techniques for efficient, graph-based matching of rumour patterns with effective load shedding that discards some of the input data while minimising the loss in accuracy. Experiments with large-scale real-world datasets illustrate the robustness of our approach in terms of runtime performance and detection accuracy under diverse streaming conditions.
    DRBM-ClustNet: A Deep Restricted Boltzmann-Kohonen Architecture for Data Clustering. (arXiv:2205.06697v1 [cs.LG])
    A Bayesian Deep Restricted Boltzmann-Kohonen architecture for data clustering termed as DRBM-ClustNet is proposed. This core-clustering engine consists of a Deep Restricted Boltzmann Machine (DRBM) for processing unlabeled data by creating new features that are uncorrelated and have large variance with each other. Next, the number of clusters are predicted using the Bayesian Information Criterion (BIC), followed by a Kohonen Network-based clustering layer. The processing of unlabeled data is done in three stages for efficient clustering of the non-linearly separable datasets. In the first stage, DRBM performs non-linear feature extraction by capturing the highly complex data representation by projecting the feature vectors of $d$ dimensions into $n$ dimensions. Most clustering algorithms require the number of clusters to be decided a priori, hence here to automate the number of clusters in the second stage we use BIC. In the third stage, the number of clusters derived from BIC forms the input for the Kohonen network, which performs clustering of the feature-extracted data obtained from the DRBM. This method overcomes the general disadvantages of clustering algorithms like the prior specification of the number of clusters, convergence to local optima and poor clustering accuracy on non-linear datasets. In this research we use two synthetic datasets, fifteen benchmark datasets from the UCI Machine Learning repository, and four image datasets to analyze the DRBM-ClustNet. The proposed framework is evaluated based on clustering accuracy and ranked against other state-of-the-art clustering methods. The obtained results demonstrate that the DRBM-ClustNet outperforms state-of-the-art clustering algorithms.
    DualCF: Efficient Model Extraction Attack from Counterfactual Explanations. (arXiv:2205.06504v1 [cs.CR])
    Cloud service providers have launched Machine-Learning-as-a-Service (MLaaS) platforms to allow users to access large-scale cloudbased models via APIs. In addition to prediction outputs, these APIs can also provide other information in a more human-understandable way, such as counterfactual explanations (CF). However, such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks which aim to steal the internal functionality of models in the cloud. Due to the black-box nature of cloud models, however, a vast number of queries are inevitably required by existing attack strategies before the substitute model achieves high fidelity. In this paper, we propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model. This is motivated by our observation that current querying strategies suffer from decision boundary shift issue induced by taking far-distant queries and close-to-boundary CFs into substitute model training. We then propose DualCF strategy to circumvent the above issues, which is achieved by taking not only CF but also counterfactual explanation of CF (CCF) as pairs of training samples for the substitute model. Extensive and comprehensive experimental evaluations are conducted on both synthetic and real-world datasets. The experimental results favorably illustrate that DualCF can produce a high-fidelity model with fewer queries efficiently and effectively.
    l-Leaks: Membership Inference Attacks with Logits. (arXiv:2205.06469v1 [cs.LG])
    Machine Learning (ML) has made unprecedented progress in the past several decades. However, due to the memorability of the training data, ML is susceptible to various attacks, especially Membership Inference Attacks (MIAs), the objective of which is to infer the model's training data. So far, most of the membership inference attacks against ML classifiers leverage the shadow model with the same structure as the target model. However, empirical results show that these attacks can be easily mitigated if the shadow model is not clear about the network structure of the target model. In this paper, We present attacks based on black-box access to the target model. We name our attack \textbf{l-Leaks}. The l-Leaks follows the intuition that if an established shadow model is similar enough to the target model, then the adversary can leverage the shadow model's information to predict a target sample's membership.The logits of the trained target model contain valuable sample knowledge. We build the shadow model by learning the logits of the target model and making the shadow model more similar to the target model. Then shadow model will have sufficient confidence in the member samples of the target model. We also discuss the effect of the shadow model's different network structures to attack results. Experiments over different networks and datasets demonstrate that both of our attacks achieve strong performance.
    Uninorm-like parametric activation functions for human-understandable neural models. (arXiv:2205.06547v1 [cs.AI])
    We present a deep learning model for finding human-understandable connections between input features. Our approach uses a parameterized, differentiable activation function, based on the theoretical background of nilpotent fuzzy logic and multi-criteria decision-making (MCDM). The learnable parameter has a semantic meaning indicating the level of compensation between input features. The neural network determines the parameters using gradient descent to find human-understandable relationships between input features. We demonstrate the utility and effectiveness of the model by successfully applying it to classification problems from the UCI Machine Learning Repository.
    FRC-TOuNN: Topology Optimization of Continuous Fiber Reinforced Composites using Neural Network. (arXiv:2205.03737v1 [cs.CE] CROSS LISTED)
    In this paper, we present a topology optimization (TO) framework to simultaneously optimize the matrix topology and fiber distribution of functionally graded continuous fiber-reinforced composites (FRC). Current approaches in density-based TO for FRC use the underlying finite element mesh both for analysis and design representation. This poses several limitations while enforcing sub-element fiber spacing and generating high-resolution continuous fibers. In contrast, we propose a mesh-independent representation based on a neural network (NN) both to capture the matrix topology and fiber distribution. The implicit NN-based representation enables geometric and material queries at a higher resolution than a mesh discretization. This leads to the accurate extraction of functionally-graded continuous fibers. Further, by integrating the finite element simulations into the NN computational framework, we can leverage automatic differentiation for end-to-end automated sensitivity analysis, i.e., we no longer need to manually derive cumbersome sensitivity expressions. We demonstrate the effectiveness and computational efficiency of the proposed method through several numerical examples involving various objective functions. We also show that the optimized continuous fiber reinforced composites can be directly fabricated at high resolution using additive manufacturing.  ( 2 min )
    Modular Adaptive Policy Selection for Multi-Task Imitation Learning through Task Division. (arXiv:2203.14855v2 [cs.LG] UPDATED)
    Deep imitation learning requires many expert demonstrations, which can be hard to obtain, especially when many tasks are involved. However, different tasks often share similarities, so learning them jointly can greatly benefit them and alleviate the need for many demonstrations. But, joint multi-task learning often suffers from negative transfer, sharing information that should be task-specific. In this work, we introduce a method to perform multi-task imitation while allowing for task-specific features. This is done by using proto-policies as modules to divide the tasks into simple sub-behaviours that can be shared. The proto-policies operate in parallel and are adaptively chosen by a selector mechanism that is jointly trained with the modules. Experiments on different sets of tasks show that our method improves upon the accuracy of single agents, task-conditioned and multi-headed multi-task agents, as well as state-of-the-art meta learning agents. We also demonstrate its ability to autonomously divide the tasks into both shared and task-specific sub-behaviours.  ( 2 min )
    Exploiting Expert-guided Symmetry Detection in Offline Reinforcement Learning. (arXiv:2112.09943v2 [cs.LG] UPDATED)
    Offline estimation of the dynamical model of a Markov Decision Process (MDP) is a non-trivial task that greatly depends on the data available to the learning phase. Sometimes the dynamics of the model is invariant with respect to some transformations of the current state and action. Recent works showed that an expert-guided pipeline relying on Density Estimation methods as Deep Neural Network based Normalizing Flows effectively detects this structure in deterministic environments, both categorical and continuous-valued. The acquired knowledge can be exploited to augment the original data set, leading eventually to a reduction in the distributional shift between the true and the learnt model. Such data augmentation technique can be exploited as a preliminary process to be executed before the adoption of an Offline Reinforcement Learning architecture, increasing its performance. In this work we extend the paradigm to also tackle non deterministic MDPs, in particular 1) we propose a detection threshold in categorical environments based on statistical distances, and 2) we show that the former results lead to a performance improvement when solving the learnt MDP and then applying the optimal policy in the real environment.  ( 2 min )
    Rainbow Differential Privacy. (arXiv:2202.03974v2 [cs.CR] UPDATED)
    We extend a previous framework for designing differentially private (DP) mechanisms via randomized graph colorings that was restricted to binary functions, corresponding to colorings in a graph, to multi-valued functions. As before, datasets are nodes in the graph and any two neighboring datasets are connected by an edge. In our setting, we assume that each dataset has a preferential ordering for the possible outputs of the mechanism, each of which we refer to as a rainbow. Different rainbows partition the graph of datasets into different regions. We show that if the DP mechanism is pre-specified at the boundary of such regions and behaves identically for all same-rainbow boundary datasets, at most one optimal such mechanism can exist and the problem can be solved by means of a morphism to a line graph. We then show closed form expressions for the line graph in the case of ternary functions. Treatment of ternary queries in this paper displays enough richness to be extended to higher-dimensional query spaces with preferential query ordering, but the optimality proof does not seem to follow directly from the ternary proof.  ( 2 min )
    Improving Contextual Representation with Gloss Regularized Pre-training. (arXiv:2205.06603v1 [cs.CL])
    Though achieving impressive results on many NLP tasks, the BERT-like masked language models (MLM) encounter the discrepancy between pre-training and inference. In light of this gap, we investigate the contextual representation of pre-training and inference from the perspective of word probability distribution. We discover that BERT risks neglecting the contextual word similarity in pre-training. To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity. By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled. We design two architectures for GR-BERT and evaluate our model in downstream tasks. Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation. The GR-BERT achieves new state-of-the-art in lexical substitution task and greatly promotes BERT sentence representation in both unsupervised and supervised STS tasks.  ( 2 min )
    AutoMat: Accelerated Computational Electrochemical systems Discovery. (arXiv:2011.04426v4 [cond-mat.mtrl-sci] UPDATED)
    Large-scale electrification is vital to addressing the climate crisis, but several scientific and technological challenges remain to fully electrify both the chemical industry and transportation. In both of these areas, new electrochemical materials will be critical, but their development currently relies heavily on human-time-intensive experimental trial and error and computationally expensive first-principles, meso-scale and continuum simulations. We present an automated workflow, AutoMat, that accelerates these computational steps by introducing both automated input generation and management of simulations across scales from first principles to continuum device modeling. Furthermore, we show how to seamlessly integrate multi-fidelity predictions such as machine learning surrogates or automated robotic experiments "in-the-loop". The automated framework is implemented with design space search techniques to dramatically accelerate the overall materials discovery pipeline by implicitly learning design features that optimize device performance across several metrics. We discuss the benefits of AutoMat using examples in electrocatalysis and energy storage and highlight lessons learned.  ( 2 min )
    Fast Conditional Network Compression Using Bayesian HyperNetworks. (arXiv:2205.06404v1 [cs.LG])
    We introduce a conditional compression problem and propose a fast framework for tackling it. The problem is how to quickly compress a pretrained large neural network into optimal smaller networks given target contexts, e.g. a context involving only a subset of classes or a context where only limited compute resource is available. To solve this, we propose an efficient Bayesian framework to compress a given large network into much smaller size tailored to meet each contextual requirement. We employ a hypernetwork to parameterize the posterior distribution of weights given conditional inputs and minimize a variational objective of this Bayesian neural network. To further reduce the network sizes, we propose a new input-output group sparsity factorization of weights to encourage more sparseness in the generated weights. Our methods can quickly generate compressed networks with significantly smaller sizes than baseline methods.
    Verifiable and Compositional Reinforcement Learning Systems. (arXiv:2106.05864v3 [cs.LG] UPDATED)
    We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process (pMDP) which is used to plan and to analyze compositions of subsystems, and of the collection of low-level subsystems themselves. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems; if they each learn a policy satisfying the appropriate subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the pMDP, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. As an additional benefit, this procedure allows for particularly challenging or important components of an overall task to be determined automatically, and focused on, during training. Experimental results demonstrate the presented framework's novel capabilities.  ( 3 min )
    Collaborative Drug Discovery: Inference-level Data Protection Perspective. (arXiv:2205.06506v1 [cs.CR])
    Pharmaceutical industry can better leverage its data assets to virtualize drug discovery through a collaborative machine learning platform. On the other hand, there are non-negligible risks stemming from the unintended leakage of participants' training data, hence, it is essential for such a platform to be secure and privacy-preserving. This paper describes a privacy risk assessment for collaborative modeling in the preclinical phase of drug discovery to accelerate the selection of promising drug candidates. After a short taxonomy of state-of-the-art inference attacks we adopt and customize several to the underlying scenario. Finally we describe and experiments with a handful of relevant privacy protection techniques to mitigate such attacks.  ( 2 min )
    Computing Multiple Image Reconstructions with a Single Hypernetwork. (arXiv:2202.11009v3 [cs.CV] UPDATED)
    Deep learning based techniques achieve state-of-the-art results in a wide range of image reconstruction tasks like compressed sensing. These methods almost always have hyperparameters, such as the weight coefficients that balance the different terms in the optimized loss function. The typical approach is to train the model for a hyperparameter setting determined with some empirical or theoretical justification. Thus, at inference time, the model can only compute reconstructions corresponding to the pre-determined hyperparameter values. In this work, we present a hypernetwork-based approach, called HyperRecon, to train reconstruction models that are agnostic to hyperparameter settings. At inference time, HyperRecon can efficiently produce diverse reconstructions, which would each correspond to different hyperparameter values. In this framework, the user is empowered to select the most useful output(s) based on their own judgement. We demonstrate our method in compressed sensing, super-resolution and denoising tasks, using two large-scale and publicly-available MRI datasets. Our code is available at https://github.com/alanqrwang/hyperrecon.  ( 2 min )
    Autonomous Navigation and Configuration of Integrated Access Backhauling for UAV Base Station Using Reinforcement Learning. (arXiv:2112.07313v2 [cs.LG] UPDATED)
    Fast and reliable connectivity is essential to enhancing situational awareness and operational efficiency for public safety mission-critical (MC) users. In emergency or disaster circumstances, where existing cellular network coverage and capacity may not be available to meet MC communication demands, deployable-network-based solutions such as cells-on-wheels/wings can be utilized swiftly to ensure reliable connection for MC users. In this paper, we consider a scenario where a macro base station (BS) is destroyed due to a natural disaster and an unmanned aerial vehicle carrying BS (UAV-BS) is set up to provide temporary coverage for users in the disaster area. The UAV-BS is integrated into the mobile network using the 5G integrated access and backhaul (IAB) technology. We propose a framework and signalling procedure for applying machine learning to this use case. A deep reinforcement learning algorithm is designed to jointly optimize the access and backhaul antenna tilt as well as the three-dimensional location of the UAV-BS in order to best serve the on-ground MC users while maintaining a good backhaul connection. Our result shows that the proposed algorithm can autonomously navigate and configure the UAV-BS to improve the throughput and reduce the drop rate of MC users.  ( 2 min )
    CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. (arXiv:2201.01018v2 [q-bio.GN] UPDATED)
    Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Experimental methods for host prediction cannot keep pace with the fast accumulation of sequenced phages. Thus, there is a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43\% accuracy at the species level. In this work, we formulate host prediction as link prediction in a knowledge graph that integrates multiple protein and DNA-based sequence features. Our implementation named CHERRY can be applied to predict hosts for newly discovered viruses and to identify viruses infecting targeted bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with 11 popular host prediction methods. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37\%. In addition, CHERRY's performance on short contigs is more stable than other tools.  ( 2 min )
    A Machine Learning Analysis of Impact of the Covid-19 Pandemic on Alcohol Consumption Habit Changes Among Healthcare Workers in the U.S. (arXiv:2112.06261v2 [cs.LG] UPDATED)
    In this paper, we discuss the impact of the Covid-19 pandemic on alcohol consumption habit changes among healthcare workers in the United States. We utilize multiple supervised and unsupervised machine learning methods and models such as Decision Trees, Logistic Regression, Naive Bayes classifier, k-Nearest Neighbors, Support Vector Machines, Multilayer perceptron, Random Forests, XGBoost, CatBoost, LightGBM, Synthetic Minority Oversampling, Chi-Squared Test and mutual information method on a mental health survey data obtained from the University of Michigan Inter-University Consortium for Political and Social Research to find out relationships between COVID-19 related negative effects and alcohol consumption habit changes among healthcare workers. Our findings suggest that COVID-19-related school closures, COVID-19-related work schedule changes and COVID-related news exposure may lead to an increase in alcohol use among healthcare workers in the United States.  ( 2 min )
    ARCADE: Adversarially Regularized Convolutional Autoencoder for Network Anomaly Detection. (arXiv:2205.01432v2 [cs.LG] UPDATED)
    As the number of heterogenous IP-connected devices and traffic volume increase, so does the potential for security breaches. The undetected exploitation of these breaches can bring severe cybersecurity and privacy risks. In this paper, we present a practical unsupervised anomaly-based deep learning detection system called ARCADE (Adversarially Regularized Convolutional Autoencoder for unsupervised network anomaly DEtection). ARCADE exploits the property of 1D Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GAN) to automatically build a profile of the normal traffic based on a subset of raw bytes of a few initial packets of network flows so that potential network anomalies and intrusions can be effectively detected before they could cause any more damage to the network. A convolutional Autoencoder (AE) is proposed that suits online detection in resource-constrained environments, and can be easily improved for environments with higher computational capabilities. An adversarial training strategy is proposed to regularize and decrease the AE's capabilities to reconstruct network flows that are out of the normal distribution, and thereby improve its anomaly detection capabilities. The proposed approach is more effective than existing state-of-the-art deep learning approaches for network anomaly detection and significantly reduces detection time. The evaluation results show that the proposed approach is suitable for anomaly detection on resource-constrained hardware platforms such as Raspberry Pi.  ( 2 min )
    Enhanced Bilevel Optimization via Bregman Distance. (arXiv:2107.12301v2 [math.OC] UPDATED)
    Bilevel optimization has been widely applied to many machine learning problems such as hyperparameter optimization, policy optimization and meta learning. Although many bilevel optimization methods recently have been proposed to solve the bilevel optimization problems, they still suffer from high computational complexities and do not consider the more general bilevel problems with nonsmooth regularization. In the paper, thus, we propose a class of enhanced bilevel optimization methods by using Bregman distance to solve bilevel optimization problems, where the outer subproblem is nonconvex and possibly nonsmooth, and the inner subproblem is strongly convex. Specifically, we propose a bilevel optimization method based on Bregman distance (BiO-BreD) for solving deterministic bilevel problems, which reaches a lower computational complexity than the best known results. Meanwhile, we also propose a stochastic bilevel optimization method (SBiO-BreD) to solve stochastic bilevel problems based on stochastic approximated gradients and Bregman distance. Moreover, we further propose an accelerated version of SBiO-BreD method (ASBiO-BreD) by using the variance-reduced technique, which achieves a lower computational complexity than the best known computational complexity with respect to condition number $\kappa$ and target accuracy $\epsilon$ for finding an $\epsilon$-stationary point. We employ data hyper-cleaning task to demonstrate that our algorithms outperform the existing bilevel algorithms.  ( 2 min )
    Anomaly Detection using Principles of Human Perception. (arXiv:2103.12323v4 [cs.CR] UPDATED)
    In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.  ( 2 min )
    M\"obius Convolutions for Spherical CNNs. (arXiv:2201.12212v2 [cs.CV] UPDATED)
    M\"obius transformations play an important role in both geometry and spherical image processing - they are the group of conformal automorphisms of 2D surfaces and the spherical equivalent of homographies. Here we present a novel, M\"obius-equivariant spherical convolution operator which we call M\"obius convolution, and with it, develop the foundations for M\"obius-equivariant spherical CNNs. Our approach is based on a simple observation: to achieve equivariance, we only need to consider the lower-dimensional subgroup which transforms the positions of points as seen in the frames of their neighbors. To efficiently compute M\"obius convolutions at scale we derive an approximation of the action of the transformations on spherical filters, allowing us to compute our convolutions in the spectral domain with the fast Spherical Harmonic Transform. The resulting framework is both flexible and descriptive, and we demonstrate its utility by achieving promising results in both shape classification and image segmentation tasks.  ( 2 min )
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v8 [cs.LG] UPDATED)
    In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
    Self-Supervised Learning for Domain Adaptation on Point-Clouds. (arXiv:2003.12641v5 [cs.CV] UPDATED)
    Self-supervised learning (SSL) is a technique for learning useful representations from unlabeled data. It has been applied effectively to domain adaptation (DA) on images and videos. It is still unknown if and how it can be leveraged for domain adaptation in 3D perception problems. Here we describe the first study of SSL for DA on point clouds. We introduce a new family of pretext tasks, Deformation Reconstruction, inspired by the deformations encountered in sim-to-real transformations. In addition, we propose a novel training procedure for labeled point cloud data motivated by the MixUp method called Point cloud Mixup (PCM). Evaluations on domain adaptations datasets for classification and segmentation, demonstrate a large improvement over existing and baseline methods.
    Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions. (arXiv:2205.06811v1 [cs.LG])
    We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.
    Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime. (arXiv:2205.06798v1 [cs.LG])
    The generalization performance of kernel ridge regression (KRR) exhibits a multi-phased pattern that crucially depends on the scaling relationship between the sample size $n$ and the underlying dimension $d$. This phenomenon is due to the fact that KRR sequentially learns functions of increasing complexity as the sample size increases; when $d^{k-1}\ll n\ll d^{k}$, only polynomials with degree less than $k$ are learned. In this paper, we present sharp asymptotic characterization of the performance of KRR at the critical transition regions with $n \asymp d^k$, for $k\in\mathbb{Z}^{+}$. Our asymptotic characterization provides a precise picture of the whole learning process and clarifies the impact of various parameters (including the choice of the kernel function) on the generalization performance. In particular, we show that the learning curves of KRR can have a delicate "double descent" behavior due to specific bias-variance trade-offs at different polynomial scaling regimes.
    A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challenges, and Opportunities. (arXiv:2205.06743v1 [cs.LG])
    Few-shot learning (FSL) has emerged as an effective learning method and shows great potential. Despite the recent creative works in tackling FSL tasks, learning valid information rapidly from just a few or even zero samples still remains a serious challenge. In this context, we extensively investigated 200+ latest papers on FSL published in the past three years, aiming to present a timely and comprehensive overview of the most recent advances in FSL along with impartial comparisons of the strengths and weaknesses of the existing works. For the sake of avoiding conceptual confusion, we first elaborate and compare a set of similar concepts including few-shot learning, transfer learning, and meta-learning. Furthermore, we propose a novel taxonomy to classify the existing work according to the level of abstraction of knowledge in accordance with the challenges of FSL. To enrich this survey, in each subsection we provide in-depth analysis and insightful discussion about recent advances on these topics. Moreover, taking computer vision as an example, we highlight the important application of FSL, covering various research hotspots. Finally, we conclude the survey with unique insights into the technology evolution trends together with potential future research opportunities in the hope of providing guidance to follow-up research.
    Differentiable Graph Module (DGM) for Graph Convolutional Networks. (arXiv:2002.04999v4 [cs.LG] UPDATED)
    Graph deep learning has recently emerged as a powerful ML concept allowing to generalize successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results on a broad spectrum of applications ranging from social science, biomedicine, and particle physics to computer vision, graphics, and chemistry. One of the limitations of the majority of current graph neural network architectures is that they are often restricted to the transductive setting and rely on the assumption that the underlying graph is {\em known} and {\em fixed}. Often, this assumption is not true since the graph may be noisy, or partially and even completely unknown. In such cases, it would be helpful to infer the graph directly from the data, especially in inductive settings where some nodes were not present in the graph at training time. Furthermore, learning a graph may become an end in itself, as the inferred structure may provide complementary insights next to the downstream task. In this paper, we introduce Differentiable Graph Module (DGM), a learnable function that predicts edge probabilities in the graph which are optimal for the downstream task. DGM can be combined with convolutional graph neural network layers and trained in an end-to-end fashion. We provide an extensive evaluation of applications from the domains of healthcare (disease prediction), brain imaging (age prediction), computer graphics (3D point cloud segmentation), and computer vision (zero-shot learning). We show that our model provides a significant improvement over baselines both in transductive and inductive settings and achieves state-of-the-art results.
    Productivity Assessment of Neural Code Completion. (arXiv:2205.06537v1 [cs.SE])
    Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers' productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers' perception of productivity.
    Latent-Graph Learning for Disease Prediction. (arXiv:2003.13620v2 [cs.LG] UPDATED)
    Recently, Graph Convolutional Networks (GCNs) have proven to be a powerful machine learning tool for Computer-Aided Diagnosis (CADx) and disease prediction. A key component in these models is to build a population graph, where the graph adjacency matrix represents pair-wise patient similarities. Until now, the similarity metrics have been defined manually, usually based on meta-features like demographics or clinical scores. The definition of the metric, however, needs careful tuning, as GCNs are very sensitive to the graph structure. In this paper, we demonstrate for the first time in the CADx domain that it is possible to learn a single, optimal graph towards the GCN's downstream task of disease classification. To this end, we propose a novel, end-to-end trainable graph learning architecture for dynamic and localized graph pruning. Unlike commonly employed spectral GCN approaches, our GCN is spatial and inductive, and can thus infer previously unseen patients as well. We demonstrate significant classification improvements with our learned graph on two CADx problems in medicine. We further explain and visualize this result using an artificial dataset, underlining the importance of graph learning for more accurate and robust inference with GCNs in medical applications.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v1 [stat.ML])
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v1 [stat.ML])
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.
    The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes. (arXiv:2205.06799v1 [cs.SD])
    The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audio human activity recognition from smartwatch sensor data; and in the Mosquitoes Sub-Challenge, mosquitoes need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the usual ComPaRE and BoAW features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectRum toolkit; in addition, we add end-to-end sequential modelling, and a log-mel-128-BNN.
    Toward A Formalized Approach for Spike Sorting Algorithms and Hardware Evaluation. (arXiv:2205.06514v1 [cs.LG])
    Spike sorting algorithms are used to separate extracellular recordings of neuronal populations into single-unit spike activities. The development of customized hardware implementing spike sorting algorithms is burgeoning. However, there is a lack of a systematic approach and a set of standardized evaluation criteria to facilitate direct comparison of both software and hardware implementations. In this paper, we formalize a set of standardized criteria and a publicly available synthetic dataset entitled Synthetic Simulations Of Extracellular Recordings (SSOER), which was constructed by aggregating existing synthetic datasets with varying Signal-To-Noise Ratios (SNRs). Furthermore, we present a benchmark for future comparison, and use our criteria to evaluate a simulated Resistive Random-Access Memory (RRAM) In-Memory Computing (IMC) system using the Discrete Wavelet Transform (DWT) for feature extraction. Our system consumes approximately (per channel) 10.72mW and occupies an area of 0.66mm$^2$ in a 22nm FDSOI Complementary Metal-Oxide-Semiconductor (CMOS) process.
    NN-EUCLID: deep-learning hyperelasticity without stress data. (arXiv:2205.06664v1 [cs.LG])
    We propose a new approach for unsupervised learning of hyperelastic constitutive laws with physics-consistent deep neural networks. In contrast to supervised learning, which assumes the availability of stress-strain pairs, the approach only uses realistically measurable full-field displacement and global reaction force data, thus it lies within the scope of our recent framework for Efficient Unsupervised Constitutive Law Identification and Discovery (EUCLID) and we denote it as NN-EUCLID. The absence of stress labels is compensated for by leveraging a physics-motivated loss function based on the conservation of linear momentum to guide the learning process. The constitutive model is based on input-convex neural networks, which are capable of learning a function that is convex with respect to its inputs. By employing a specially designed neural network architecture, multiple physical and thermodynamic constraints for hyperelastic constitutive laws, such as material frame indifference, (poly-)convexity, and stress-free reference configuration are automatically satisfied. We demonstrate the ability of the approach to accurately learn several hidden isotropic and anisotropic hyperelastic constitutive laws - including e.g., Mooney-Rivlin, Arruda-Boyce, Ogden, and Holzapfel models - without using stress data. For anisotropic hyperelasticity, the unknown anisotropic fiber directions are automatically discovered jointly with the constitutive model. The neural network-based constitutive models show good generalization capability beyond the strain states observed during training and are readily deployable in a general finite element framework for simulating complex mechanical boundary value problems with good accuracy.  ( 2 min )
    PoisonedEncoder: Poisoning the Unlabeled Pre-training Data in Contrastive Learning. (arXiv:2205.06401v1 [cs.CR])
    Contrastive learning pre-trains an image encoder using a large amount of unlabeled data such that the image encoder can be used as a general-purpose feature extractor for various downstream tasks. In this work, we propose PoisonedEncoder, a data poisoning attack to contrastive learning. In particular, an attacker injects carefully crafted poisoning inputs into the unlabeled pre-training data, such that the downstream classifiers built based on the poisoned encoder for multiple target downstream tasks simultaneously classify attacker-chosen, arbitrary clean inputs as attacker-chosen, arbitrary classes. We formulate our data poisoning attack as a bilevel optimization problem, whose solution is the set of poisoning inputs; and we propose a contrastive-learning-tailored method to approximately solve it. Our evaluation on multiple datasets shows that PoisonedEncoder achieves high attack success rates while maintaining the testing accuracy of the downstream classifiers built upon the poisoned encoder for non-attacker-chosen inputs. We also evaluate five defenses against PoisonedEncoder, including one pre-processing, three in-processing, and one post-processing defenses. Our results show that these defenses can decrease the attack success rate of PoisonedEncoder, but they also sacrifice the utility of the encoder or require a large clean pre-training dataset.  ( 2 min )
    OFedQIT: Communication-Efficient Online Federated Learning via Quantization and Intermittent Transmission. (arXiv:2205.06491v1 [cs.LG])
    Online federated learning (OFL) is a promising framework to collaboratively learn a sequence of non-linear functions (or models) from distributed streaming data incoming to multiple clients while keeping the privacy of their local data. In this framework, we first construct a vanilla method (named OFedAvg) by incorporating online gradient descent (OGD) into the de facto aggregation method (named FedAvg). Despite its optimal asymptotic performance, OFedAvg suffers from heavy communication overhead and long learning delay. To tackle these shortcomings, we propose a communication-efficient OFL algorithm (named OFedQIT) by means of a stochastic quantization and an intermittent transmission. Our major contribution is to theoretically prove that OFedQIT over $T$ time slots can achieve an optimal sublinear regret bound $\mathcal{O}(\sqrt{T})$ for any real data (including non-IID data) while significantly reducing the communication overhead. Furthermore, this optimality is still guaranteed even when a small fraction of clients (having faster processing time and high-quality communication channel) in a network are participated at once. Our analysis reveals that OFedQIT successfully addresses the drawbacks of OFedAvg while maintaining superior learning accuracy. Experiments with real datasets demonstrate the effectiveness of our algorithm on various online classification and regression tasks.  ( 2 min )
    Convergence Analysis of Deep Residual Networks. (arXiv:2205.06571v1 [cs.LG])
    Various powerful deep neural network architectures have made great contribution to the exciting successes of deep learning in the past two decades. Among them, deep Residual Networks (ResNets) are of particular importance because they demonstrated great usefulness in computer vision by winning the first place in many deep learning competitions. Also, ResNets were the first class of neural networks in the development history of deep learning that are really deep. It is of mathematical interest and practical meaning to understand the convergence of deep ResNets. We aim at characterizing the convergence of deep ResNets as the depth tends to infinity in terms of the parameters of the networks. Toward this purpose, we first give a matrix-vector description of general deep neural networks with shortcut connections and formulate an explicit expression for the networks by using the notions of activation domains and activation matrices. The convergence is then reduced to the convergence of two series involving infinite products of non-square matrices. By studying the two series, we establish a sufficient condition for pointwise convergence of ResNets. Our result is able to give justification for the design of ResNets. We also conduct experiments on benchmark machine learning data to verify our results.  ( 2 min )
    Convergence of Deep Neural Networks with General Activation Functions and Pooling. (arXiv:2205.06570v1 [cs.LG])
    Deep neural networks, as a powerful system to represent high dimensional complex functions, play a key role in deep learning. Convergence of deep neural networks is a fundamental issue in building the mathematical foundation for deep learning. We investigated the convergence of deep ReLU networks and deep convolutional neural networks in two recent researches (arXiv:2107.12530, 2109.13542). Only the Rectified Linear Unit (ReLU) activation was studied therein, and the important pooling strategy was not considered. In this current work, we study the convergence of deep neural networks as the depth tends to infinity for two other important activation functions: the leaky ReLU and the sigmoid function. Pooling will also be studied. As a result, we prove that the sufficient condition established in arXiv:2107.12530, 2109.13542 is still sufficient for the leaky ReLU networks. For contractive activation functions such as the sigmoid function, we establish a weaker sufficient condition for uniform convergence of deep neural networks.  ( 2 min )
    Test-time Fourier Style Calibration for Domain Generalization. (arXiv:2205.06427v1 [cs.CV])
    The topic of generalizing machine learning models learned on a collection of source domains to unknown target domains is challenging. While many domain generalization (DG) methods have achieved promising results, they primarily rely on the source domains at train-time without manipulating the target domains at test-time. Thus, it is still possible that those methods can overfit to source domains and perform poorly on target domains. Driven by the observation that domains are strongly related to styles, we argue that reducing the gap between source and target styles can boost models' generalizability. To solve the dilemma of having no access to the target domain during training, we introduce Test-time Fourier Style Calibration (TF-Cal) for calibrating the target domain style on the fly during testing. To access styles, we utilize Fourier transformation to decompose features into amplitude (style) features and phase (semantic) features. Furthermore, we present an effective technique to Augment Amplitude Features (AAF) to complement TF-Cal. Extensive experiments on several popular DG benchmarks and a segmentation dataset for medical images demonstrate that our method outperforms state-of-the-art methods.  ( 2 min )
    Deep Learning for Prawn Farming: Forecasting and Anomaly Detection. (arXiv:2205.06359v1 [cs.LG])
    We present a decision support system for managing water quality in prawn ponds. The system uses various sources of data and deep learning models in a novel way to provide 24-hour forecasting and anomaly detection of water quality parameters. It provides prawn farmers with tools to proactively avoid a poor growing environment, thereby optimising growth and reducing the risk of losing stock. This is a major shift for farmers who are forced to manage ponds by reactively correcting poor water quality conditions. To our knowledge, we are the first to apply Transformer as an anomaly detection model, and the first to apply anomaly detection in general to this aquaculture problem. Our technical contributions include adapting ForecastNet for multivariate data and adapting Transformer and the Attention model to incorporate weather forecast data into their decoders. We attain an average mean absolute percentage error of 12% for dissolved oxygen forecasts and we demonstrate two anomaly detection case studies. The system is successfully running in its second year of deployment on a commercial prawn farm.  ( 2 min )
    StyLandGAN: A StyleGAN based Landscape Image Synthesis using Depth-map. (arXiv:2205.06611v1 [cs.CV])
    Despite recent success in conditional image synthesis, prevalent input conditions such as semantics and edges are not clear enough to express `Linear (Ridges)' and `Planar (Scale)' representations. To address this problem, we propose a novel framework StyLandGAN, which synthesizes desired landscape images using a depth map which has higher expressive power. Our StyleLandGAN is extended from the unconditional generation model to accept input conditions. We also propose a '2-phase inference' pipeline which generates diverse depth maps and shifts local parts so that it can easily reflect user's intend. As a comparison, we modified the existing semantic image synthesis models to accept a depth map as well. Experimental results show that our method is superior to existing methods in quality, diversity, and depth-accuracy.  ( 2 min )
    KASAM: Spline Additive Models for Function Approximation. (arXiv:2205.06376v1 [cs.LG])
    Neural networks have been criticised for their inability to perform continual learning due to catastrophic forgetting and rapid unlearning of a past concept when a new concept is introduced. Catastrophic forgetting can be alleviated by specifically designed models and training techniques. This paper outlines a novel Spline Additive Model (SAM). SAM exhibits intrinsic memory retention with sufficient expressive power for many practical tasks, but is not a universal function approximator. SAM is extended with the Kolmogorov-Arnold representation theorem to a novel universal function approximator, called the Kolmogorov-Arnold Spline Additive Model - KASAM. The memory retention, expressive power and limitations of SAM and KASAM are illustrated analytically and empirically. SAM exhibited robust but imperfect memory retention, with small regions of overlapping interference in sequential learning tasks. KASAM exhibited greater susceptibility to catastrophic forgetting. KASAM in combination with pseudo-rehearsal training techniques exhibited superior performance in regression tasks and memory retention.  ( 2 min )
    Interpretable Climate Change Modeling With Progressive Cascade Networks. (arXiv:2205.06351v1 [cs.LG])
    Typical deep learning approaches to modeling high-dimensional data often result in complex models that do not easily reveal a new understanding of the data. Research in the deep learning field is very actively pursuing new methods to interpret deep neural networks and to reduce their complexity. An approach is described here that starts with linear models and incrementally adds complexity only as supported by the data. An application is shown in which models that map global temperature and precipitation to years are trained to investigate patterns associated with changes in climate.  ( 2 min )
    A hybrid data driven-physics constrained Gaussian process regression framework with deep kernel for uncertainty quantification. (arXiv:2205.06494v1 [cs.LG])
    Gaussian process regression (GPR) has been a well-known machine learning method for various applications such as uncertainty quantifications (UQ). However, GPR is inherently a data-driven method, which requires sufficiently large dataset. If appropriate physics constraints (e.g. expressed in partial differential equations) can be incorporated, the amount of data can be greatly reduced and the accuracy further improved. In this work, we propose a hybrid data driven-physics constrained Gaussian process regression framework. We encode the physics knowledge with Boltzmann-Gibbs distribution and derive our model through maximum likelihood (ML) approach. We apply deep kernel learning method. The proposed model learns from both data and physics constraints through the training of a deep neural network, which serves as part of the covariance function in GPR. The proposed model achieves good results in high-dimensional problem, and correctly propagate the uncertainty, with very limited labelled data provided.  ( 2 min )
    Warm-starting DARTS using meta-learning. (arXiv:2205.06355v1 [cs.LG])
    Neural architecture search (NAS) has shown great promise in the field of automated machine learning (AutoML). NAS has outperformed hand-designed networks and made a significant step forward in the field of automating the design of deep neural networks, thus further reducing the need for human expertise. However, most research is done targeting a single specific task, leaving research of NAS methods over multiple tasks mostly overlooked. Generally, there exist two popular ways to find an architecture for some novel task. Either searching from scratch, which is ineffective by design, or transferring discovered architectures from other tasks, which provides no performance guarantees and is probably not optimal. In this work, we present a meta-learning framework to warm-start Differentiable architecture search (DARTS). DARTS is a NAS method that can be initialized with a transferred architecture and is able to quickly adapt to new tasks. A task similarity measure is used to determine which transfer architecture is selected, as transfer architectures found on similar tasks will likely perform better. Additionally, we employ a simple meta-transfer architecture that was learned over multiple tasks. Experiments show that warm-started DARTS is able to find competitive performing architectures while reducing searching costs on average by 60%.  ( 2 min )
    How to Combine Membership-Inference Attacks on Multiple Updated Models. (arXiv:2205.06369v1 [cs.LG])
    A large body of research has shown that machine learning models are vulnerable to membership inference (MI) attacks that violate the privacy of the participants in the training data. Most MI research focuses on the case of a single standalone model, while production machine-learning platforms often update models over time, on data that often shifts in distribution, giving the attacker more information. This paper proposes new attacks that take advantage of one or more model updates to improve MI. A key part of our approach is to leverage rich information from standalone MI attacks mounted separately against the original and updated models, and to combine this information in specific ways to improve attack effectiveness. We propose a set of combination functions and tuning methods for each, and present both analytical and quantitative justification for various options. Our results on four public datasets show that our attacks are effective at using update information to give the adversary a significant advantage over attacks on standalone models, but also compared to a prior MI attack that takes advantage of model updates in a related machine-unlearning setting. We perform the first measurements of the impact of distribution shift on MI attacks with model updates, and show that a more drastic distribution shift results in significantly higher MI risk than a gradual shift. Our code is available at https://www.github.com/stanleykywu/model-updates.  ( 2 min )
    Using Natural Sentences for Understanding Biases in Language Models. (arXiv:2205.06303v1 [cs.CL])
    Evaluation of biases in language models is often limited to synthetically generated datasets. This dependence traces back to the need for a prompt-style dataset to trigger specific behaviors of language models. In this paper, we address this gap by creating a prompt dataset with respect to occupations collected from real-world natural sentences present in Wikipedia. We aim to understand the differences between using template-based prompts and natural sentence prompts when studying gender-occupation biases in language models. We find bias evaluations are very sensitive to the design choices of template prompts, and we propose using natural sentence prompts for systematic evaluations to step away from design choices that could introduce bias in the observations.  ( 2 min )
    Adaptive Block Floating-Point for Analog Deep Learning Hardware. (arXiv:2205.06287v1 [cs.LG])
    Analog mixed-signal (AMS) devices promise faster, more energy-efficient deep neural network (DNN) inference than their digital counterparts. However, recent studies show that DNNs on AMS devices with fixed-point numbers can incur an accuracy penalty because of precision loss. To mitigate this penalty, we present a novel AMS-compatible adaptive block floating-point (ABFP) number representation. We also introduce amplification (or gain) as a method for increasing the accuracy of the number representation without increasing the bit precision of the output. We evaluate the effectiveness of ABFP on the DNNs in the MLPerf datacenter inference benchmark -- realizing less than $1\%$ loss in accuracy compared to FLOAT32. We also propose a novel method of finetuning for AMS devices, Differential Noise Finetuning (DNF), which samples device noise to speed up finetuning compared to conventional Quantization-Aware Training.  ( 2 min )
    $\alpha$-GAN: Convergence and Estimation Guarantees. (arXiv:2205.06393v1 [cs.LG])
    We prove a two-way correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences. We then focus on $\alpha$-GAN, defined via the $\alpha$-loss, which interpolates several GANs (Hellinger, vanilla, Total Variation) and corresponds to the minimization of the Arimoto divergence. We show that the Arimoto divergences induced by $\alpha$-GAN equivalently converge, for all $\alpha\in \mathbb{R}_{>0}\cup\{\infty\}$. However, under restricted learning models and finite samples, we provide estimation bounds which indicate diverse GAN behavior as a function of $\alpha$. Finally, we present empirical results on a toy dataset that highlight the practical utility of tuning the $\alpha$ hyperparameter.  ( 2 min )
    Integrating User and Item Reviews in Deep Cooperative Neural Networks for Movie Recommendation. (arXiv:2205.06296v1 [cs.IR])
    User evaluations include a significant quantity of information across online platforms. This information source has been neglected by the majority of existing recommendation systems, despite its potential to ease the sparsity issue and enhance the quality of suggestions. This work presents a deep model for concurrently learning item attributes and user behaviour from review text. Deep Cooperative Neural Networks (DeepCoNN) is the suggested model consisting of two parallel neural networks connected in their final layers. One of the networks focuses on learning user behaviour from reviews submitted by the user, while the other network learns item attributes from user reviews. On top, a shared layer is added to connect these two networks. Similar to factorization machine approaches, the shared layer allows latent factors acquired for people and things to interact with each other. On a number of datasets, DeepCoNN surpasses all baseline recommendation systems, according to experimental findings.  ( 2 min )
    Collaborative Multi-agent Stochastic Linear Bandits. (arXiv:2205.06331v1 [cs.LG])
    We study a collaborative multi-agent stochastic linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret. In this setting, each agent has its own linear bandit problem (its own reward parameter) and the goal is to select the best global action w.r.t. the average of their reward parameters. At each round, each agent proposes an action, and one action is randomly selected and played as the network action. All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents. We propose a distributed upper confidence bound (UCB) algorithm and prove a high probability bound on its $T$-round regret in which we include a linear growth of regret associated with each communication round. Our regret bound is of order $\mathcal{O}\Big(\sqrt{\frac{T}{N \log(1/|\lambda_2|)}}\cdot (\log T)^2\Big)$, where $\lambda_2$ is the second largest (in absolute value) eigenvalue of the communication matrix.  ( 2 min )
    Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations. (arXiv:2205.06333v1 [cs.RO])
    Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large labeled datasets for each task that are expensive to collect in the real world. Using self-supervised learning to obtain representations from unlabeled data can mitigate this problem. However, current self-supervised representation learning methods are mostly object agnostic, and we demonstrate that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components. In this paper, we explore the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and is queried in two different settings: (i) policy learning and (ii) object location prediction. We show that our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20 percent increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC). Furthermore, our method outperforms the baselines for the task of object localization in multi-object scenes.  ( 2 min )
    Multi-Environment Meta-Learning in Stochastic Linear Bandits. (arXiv:2205.06326v1 [cs.LG])
    In this work we investigate meta-learning (or learning-to-learn) approaches in multi-task linear stochastic bandit problems that can originate from multiple environments. Inspired by the work of [1] on meta-learning in a sequence of linear bandit problems whose parameters are sampled from a single distribution (i.e., a single environment), here we consider the feasibility of meta-learning when task parameters are drawn from a mixture distribution instead. For this problem, we propose a regularized version of the OFUL algorithm that, when trained on tasks with labeled environments, achieves low regret on a new task without requiring knowledge of the environment from which the new task originates. Specifically, our regret bound for the new algorithm captures the effect of environment misclassification and highlights the benefits over learning each task separately or meta-learning without recognition of the distinct mixture components.  ( 2 min )
    Detailed Balanced Chemical Reaction Networks as Generalized Boltzmann Machines. (arXiv:2205.06313v1 [q-bio.MN])
    Can a micron sized sack of interacting molecules understand, and adapt to a constantly-fluctuating environment? Cellular life provides an existence proof in the affirmative, but the principles that allow for life's existence are far from being proven. One challenge in engineering and understanding biochemical computation is the intrinsic noise due to chemical fluctuations. In this paper, we draw insights from machine learning theory, chemical reaction network theory, and statistical physics to show that the broad and biologically relevant class of detailed balanced chemical reaction networks is capable of representing and conditioning complex distributions. These results illustrate how a biochemical computer can use intrinsic chemical noise to perform complex computations. Furthermore, we use our explicit physical model to derive thermodynamic costs of inference.  ( 2 min )
    Design and Implementation of a Quantum Kernel for Natural Language Processing. (arXiv:2205.06409v1 [cs.CL])
    Natural language processing (NLP) is the field that attempts to make human language accessible to computers, and it relies on applying a mathematical model to express the meaning of symbolic language. One such model, DisCoCat, defines how to express both the meaning of individual words as well as their compositional nature. This model can be naturally implemented on quantum computers, leading to the field quantum NLP (QNLP). Recent experimental work used quantum machine learning techniques to map from text to class label using the expectation value of the quantum encoded sentence. Theoretical work has been done on computing the similarity of sentences but relies on an unrealized quantum memory store. The main goal of this thesis is to leverage the DisCoCat model to design a quantum-based kernel function that can be used by a support vector machine (SVM) for NLP tasks. Two similarity measures were studied: (i) the transition amplitude approach and (ii) the SWAP test. A simple NLP meaning classification task from previous work was used to train the word embeddings and evaluate the performance of both models. The Python module lambeq and its related software stack was used for implementation. The explicit model from previous work was used to train word embeddings and achieved a testing accuracy of $93.09 \pm 0.01$%. It was shown that both the SVM variants achieved a higher testing accuracy of $95.72 \pm 0.01$% for approach (i) and $97.14 \pm 0.01$% for (ii). The SWAP test was then simulated under a noise model defined by the real quantum device, ibmq_guadalupe. The explicit model achieved an accuracy of $91.94 \pm 0.01$% while the SWAP test SVM achieved 96.7% on the testing dataset, suggesting that the kernelized classifiers are resilient to noise. These are encouraging results and motivate further investigations of our proposed kernelized QNLP paradigm.  ( 2 min )
    Improving Sequential Query Recommendation with Immediate User Feedback. (arXiv:2205.06297v1 [cs.IR])
    We propose an algorithm for next query recommendation in interactive data exploration settings, like knowledge discovery for information gathering. The state-of-the-art query recommendation algorithms are based on sequence-to-sequence learning approaches that exploit historical interaction data. We propose to augment the transformer-based causal language models for query recommendations to adapt to the immediate user feedback using multi-armed bandit (MAB) framework. We conduct a large-scale experimental study using log files from a popular online literature discovery service and demonstrate that our algorithm improves the cumulative regret substantially, with respect to the state-of-the-art transformer-based query recommendation models, which do not make use of the immediate user feedback. Our data model and source code are available at ~\url{https://anonymous.4open.science/r/exp3_ss-9985/}.  ( 2 min )
    Modularity in NEAT Reinforcement Learning Networks. (arXiv:2205.06451v1 [cs.NE])
    Modularity is essential to many well-performing structured systems, as it is a useful means of managing complexity [8]. An analysis of modularity in neural networks produced by machine learning algorithms can offer valuable insight into the workings of such algorithms and how modularity can be leveraged to improve performance. However, this property is often overlooked in the neuroevolutionary literature, so the modular nature of many learning algorithms is unknown. This property was assessed on the popular algorithm "NeuroEvolution of Augmenting Topologies" (NEAT) for standard simulation benchmark control problems due to NEAT's ability to optimise network topology. This paper shows that NEAT networks seem to rapidly increase in modularity over time with the rate and convergence dependent on the problem. Interestingly, NEAT tends towards increasingly modular networks even when network fitness converges. It was shown that the ideal level of network modularity in the explored parameter space is highly dependent on other network variables, dispelling theories that modularity has a straightforward relationship to network performance. This is further proven in this paper by demonstrating that rewarding modularity directly did not improve fitness.  ( 2 min )
  • Open

    On the Existence of Simpler Machine Learning Models. (arXiv:1908.01755v4 [cs.LG] UPDATED)
    It is almost always easier to find an accurate-but-complex model than an accurate-yet-simple model. Finding optimal, sparse, accurate models of various forms (linear models with integer coefficients, decision sets, rule lists, decision trees) is generally NP-hard. We often do not know whether the search for a simpler model will be worthwhile, and thus we do not go to the trouble of searching for one. In this work, we ask an important practical question: can accurate-yet-simple models be proven to exist, or shown likely to exist, before explicitly searching for them? We hypothesize that there is an important reason that simple-yet-accurate models often do exist. This hypothesis is that the size of the Rashomon set is often large, where the Rashomon set is the set of almost-equally-accurate models from a function class. If the Rashomon set is large, it contains numerous accurate models, and perhaps at least one of them is the simple model we desire. In this work, we formally present the Rashomon ratio as a new gauge of simplicity for a learning problem, depending on a function class and a data set. The Rashomon ratio is the ratio of the volume of the set of accurate models to the volume of the hypothesis space, and it is different from standard complexity measures from statistical learning theory. Insight from studying the Rashomon ratio provides an easy way to check whether a simpler model might exist for a problem before finding it, namely whether several different machine learning methods achieve similar performance on the data. In that sense, the Rashomon ratio is a powerful tool for understanding why and when an accurate-yet-simple model might exist. If, as we hypothesize in this work, many real-world data sets admit large Rashomon sets, the implications are vast: it means that simple or interpretable models may often be used for high-stakes decisions without losing accuracy.
    An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices. (arXiv:2205.06308v1 [math.PR])
    We consider random matrices whose entries are obtained by applying a (nonlinear) kernel function to the pairwise inner products between $n$ independent data vectors drawn uniformly from the unit sphere in $\mathbb{R}^d$. Our study of this model is motivated by problems in machine learning, statistics, and signal processing, where such inner-product kernel random matrices and their spectral properties play important roles. Under mild conditions on the kernel function, we establish the weak-limit of the empirical spectral distribution of these matrices when $d, n \to \infty$ such that $n / d^\ell \to \kappa \in (0, \infty)$, for some fixed $\ell \in \mathbb{N}$ and $\kappa \in \mathbb{R}$. This generalizes an earlier result of Cheng and Singer (2013), who studied the same model in the linear scaling regime (with $\ell = 1$ and $n/d \to \kappa$). The main insight of our work is a general equivalence principle: the spectrum of the random kernel matrix is asymptotically equivalent to that of a simpler matrix model, constructed as the linear combination of a (shifted) Wishart matrix and an independent matrix drawn from the Gaussian orthogonal ensemble. The aspect ratio of the Wishart matrix and the coefficients of the linear combination are determined by $\ell$ and by the expansion of the kernel function in the orthogonal Hermite polynomial basis. Consequently, the limiting spectrum of the random kernel matrix can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law.  ( 2 min )
    Heavy-Tail Phenomenon in Decentralized SGD. (arXiv:2205.06689v1 [stat.ML])
    Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD %addition of network structure can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.
    Probabilistic Estimation of Chirp Instantaneous Frequency Using Gaussian Processes. (arXiv:2205.06306v1 [stat.ML])
    We present a probabilistic approach for estimating chirp signal and its instantaneous frequency function when the true forms of the chirp and instantaneous frequency are unknown. To do so, we represent them by joint cascading Gaussian processes governed by a non-linear stochastic differential equation, and estimate their posterior distribution by using stochastic filters and smoothers. The model parameters are determined via maximum likelihood estimation. Theoretical results show that the estimation method has a bounded mean squared error. Experiments show that the method outperforms a number of baseline methods on a synthetic model, and we also apply the method to analyse a gravitational wave data.  ( 2 min )
    Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions. (arXiv:2205.06811v1 [cs.LG])
    We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.
    Fast Conditional Network Compression Using Bayesian HyperNetworks. (arXiv:2205.06404v1 [cs.LG])
    We introduce a conditional compression problem and propose a fast framework for tackling it. The problem is how to quickly compress a pretrained large neural network into optimal smaller networks given target contexts, e.g. a context involving only a subset of classes or a context where only limited compute resource is available. To solve this, we propose an efficient Bayesian framework to compress a given large network into much smaller size tailored to meet each contextual requirement. We employ a hypernetwork to parameterize the posterior distribution of weights given conditional inputs and minimize a variational objective of this Bayesian neural network. To further reduce the network sizes, we propose a new input-output group sparsity factorization of weights to encourage more sparseness in the generated weights. Our methods can quickly generate compressed networks with significantly smaller sizes than baseline methods.  ( 2 min )
    Weak consistency of the 1-nearest neighbor measure with applications to missing data. (arXiv:1902.02408v3 [math.ST] UPDATED)
    When data is partially missing at random, imputation and importance weighting are often used to estimate moments of the unobserved population. In this paper, we study 1-nearest neighbor (1NN) importance weighting, which estimates moments by replacing missing data with the complete data that is the nearest neighbor in the non-missing covariate space. We define an empirical measure, the 1NN measure, and show that it is weakly consistent for the measure of the missing data. The main idea behind this result is that the 1NN measure is performing inverse probability weighting in the limit. We study applications to missing data and mitigating the impact of covariate shift in prediction tasks.
    Robust and Heterogenous Odds Ratio: Estimating Price Sensitivity for Unbought Items. (arXiv:2106.11389v2 [stat.ME] UPDATED)
    Problem definition: Mining for heterogeneous responses to an intervention is a crucial step for data-driven operations, for instance to personalize treatment or pricing. We investigate how to estimate price sensitivity from transaction-level data. In causal inference terms, we estimate heterogeneous treatment effects when (a) the response to treatment (here, whether a customer buys a product) is binary, and (b) treatment assignments are partially observed (here, full information is only available for purchased items). Methodology/Results: We propose a recursive partitioning procedure to estimate heterogeneous odds ratio, a widely used measure of treatment effect in medicine and social sciences. We integrate an adversarial imputation step to allow for robust estimation even in presence of partially observed treatment assignments. We validate our methodology on synthetic data and apply it to three case studies from political science, medicine, and revenue management. Managerial Implications: Our robust heterogeneous odds ratio estimation method is a simple and intuitive tool to quantify heterogeneity in patients or customers and personalize interventions, while lifting a central limitation in many revenue management data.
    secml: A Python Library for Secure and Explainable Machine Learning. (arXiv:1912.10013v2 [cs.LG] UPDATED)
    We present \texttt{secml}, an open-source Python library for secure and explainable machine learning. It implements the most popular attacks against machine learning, including test-time evasion attacks to generate adversarial examples against deep neural networks and training-time poisoning attacks against support vector machines and many other algorithms. These attacks enable evaluating the security of learning algorithms and the corresponding defenses under both white-box and black-box threat models. To this end, \texttt{secml} provides built-in functions to compute security evaluation curves, showing how quickly classification performance decreases against increasing adversarial perturbations of the input data. \texttt{secml} also includes explainability methods to help understand why adversarial attacks succeed against a given model, by visualizing the most influential features and training prototypes contributing to each decision. It is distributed under the Apache License 2.0 and hosted at \url{https://github.com/pralab/secml}.
    The interplay between ranking and communities in networks. (arXiv:2112.12670v2 [cs.SI] UPDATED)
    Community detection and hierarchy extraction are usually thought of as separate inference tasks on networks. Considering only one of the two when studying real-world data can be an oversimplification. In this work, we present a generative model based on an interplay between community and hierarchical structures. It assumes that each node has a preference in the interaction mechanism and nodes with the same preference are more likely to interact, while heterogeneous interactions are still allowed. The sparsity of the network is exploited for implementing a more efficient algorithm. We demonstrate our method on synthetic and real-world data and compare performance with two standard approaches for community detection and ranking extraction. We find that the algorithm accurately retrieves the overall node's preference in different scenarios, and we show that it can distinguish small subsets of nodes that behave differently than the majority. As a consequence, the model can recognize whether a network has an overall preferred interaction mechanism. This is relevant in situations where there is no clear "a priori" information about what structure explains the observed network datasets well. Our model allows practitioners to learn this automatically from the data.
    Anomaly Detection using Principles of Human Perception. (arXiv:2103.12323v4 [cs.CR] UPDATED)
    In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.
    $\alpha$-GAN: Convergence and Estimation Guarantees. (arXiv:2205.06393v1 [cs.LG])
    We prove a two-way correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences. We then focus on $\alpha$-GAN, defined via the $\alpha$-loss, which interpolates several GANs (Hellinger, vanilla, Total Variation) and corresponds to the minimization of the Arimoto divergence. We show that the Arimoto divergences induced by $\alpha$-GAN equivalently converge, for all $\alpha\in \mathbb{R}_{>0}\cup\{\infty\}$. However, under restricted learning models and finite samples, we provide estimation bounds which indicate diverse GAN behavior as a function of $\alpha$. Finally, we present empirical results on a toy dataset that highlight the practical utility of tuning the $\alpha$ hyperparameter.
    Differentiable Graph Module (DGM) for Graph Convolutional Networks. (arXiv:2002.04999v4 [cs.LG] UPDATED)
    Graph deep learning has recently emerged as a powerful ML concept allowing to generalize successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results on a broad spectrum of applications ranging from social science, biomedicine, and particle physics to computer vision, graphics, and chemistry. One of the limitations of the majority of current graph neural network architectures is that they are often restricted to the transductive setting and rely on the assumption that the underlying graph is {\em known} and {\em fixed}. Often, this assumption is not true since the graph may be noisy, or partially and even completely unknown. In such cases, it would be helpful to infer the graph directly from the data, especially in inductive settings where some nodes were not present in the graph at training time. Furthermore, learning a graph may become an end in itself, as the inferred structure may provide complementary insights next to the downstream task. In this paper, we introduce Differentiable Graph Module (DGM), a learnable function that predicts edge probabilities in the graph which are optimal for the downstream task. DGM can be combined with convolutional graph neural network layers and trained in an end-to-end fashion. We provide an extensive evaluation of applications from the domains of healthcare (disease prediction), brain imaging (age prediction), computer graphics (3D point cloud segmentation), and computer vision (zero-shot learning). We show that our model provides a significant improvement over baselines both in transductive and inductive settings and achieves state-of-the-art results.
    Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets. (arXiv:2205.06595v1 [stat.ML])
    Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
    Variational Hyper-Encoding Networks. (arXiv:2005.08482v2 [stat.ML] UPDATED)
    We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
    Multiple Domain Causal Networks. (arXiv:2205.06791v1 [stat.ML])
    Observational studies are regarded as economic alternatives to randomized trials, often used in their stead to investigate and determine treatment efficacy. Due to lack of sample size, observational studies commonly combine data from multiple sources or different sites/centers. Despite the benefits of an increased sample size, a naive combination of multicenter data may result in incongruities stemming from center-specific protocols for generating cohorts or reactions towards treatments distinct to a given center, among other things. These issues arise in a variety of other contexts, including capturing a treatment effect related to an individual's unique biological characteristics. Existing methods for estimating heterogeneous treatment effects have not adequately addressed the multicenter context, but rather treat it simply as a means to obtain sufficient sample size. Additionally, previous approaches to estimating treatment effects do not straightforwardly generalize to the multicenter design, especially when required to provide treatment insights for patients from a new, unobserved center. To address these shortcomings, we propose Multiple Domain Causal Networks (MDCN), an approach that simultaneously strengthens the information sharing between similar centers while addressing the selection bias in treatment assignment through learning of a new feature embedding. In empirical evaluations, MDCN is consistently more accurate when estimating the heterogeneous treatment effect in new centers compared to benchmarks that adjust solely based on treatment imbalance or general center differences. Finally, we justify our approach by providing theoretical analyses that demonstrate that MDCN improves on the generalization bound of the new, unobserved target center.
    Clustering with missing data: which equivalent for Rubin's rules?. (arXiv:2011.13694v2 [stat.ME] UPDATED)
    Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v1 [stat.ML])
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v1 [stat.ML])
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.
    Explaining by Removing: A Unified Framework for Model Explanation. (arXiv:2011.14878v2 [cs.LG] UPDATED)
    Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.
    Improving Sequential Query Recommendation with Immediate User Feedback. (arXiv:2205.06297v1 [cs.IR])
    We propose an algorithm for next query recommendation in interactive data exploration settings, like knowledge discovery for information gathering. The state-of-the-art query recommendation algorithms are based on sequence-to-sequence learning approaches that exploit historical interaction data. We propose to augment the transformer-based causal language models for query recommendations to adapt to the immediate user feedback using multi-armed bandit (MAB) framework. We conduct a large-scale experimental study using log files from a popular online literature discovery service and demonstrate that our algorithm improves the cumulative regret substantially, with respect to the state-of-the-art transformer-based query recommendation models, which do not make use of the immediate user feedback. Our data model and source code are available at ~\url{https://anonymous.4open.science/r/exp3_ss-9985/}.
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v8 [cs.LG] UPDATED)
    In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
    Change-point Detection and Segmentation of Discrete Data using Bayesian Context Trees. (arXiv:2203.04341v2 [stat.ME] UPDATED)
    A new Bayesian modelling framework is introduced for piece-wise homogeneous variable-memory Markov chains, along with a collection of effective algorithmic tools for change-point detection and segmentation of discrete time series. Building on the recently introduced Bayesian Context Trees (BCT) framework, the distributions of different segments in a discrete time series are described as variable-memory Markov chains. Inference for the presence and location of change-points is then performed via Markov chain Monte Carlo sampling. The key observation that facilitates effective sampling is that, using one of the BCT algorithms, the prior predictive likelihood of the data can be computed exactly, integrating out all the models and parameters in each segment. This makes it possible to sample directly from the posterior distribution of the number and location of the change-points, leading to accurate estimates and providing a natural quantitative measure of uncertainty in the results. Estimates of the actual model in each segment can also be obtained, at essentially no additional computational cost. Results on both simulated and real-world data indicate that the proposed methodology performs better than or as well as state-of-the-art techniques.

  • Open

    [D] Multi-modal information sharing strategies when one "modality" is tabular data.
    I'm hunting for literature on architectures for multimodal autoencoders when one "modality" is tabular data. Keyword searches aren't getting me what I want though. I have a mixture of sequential, audio, image, and tabular data. For the non-tabular modalities it seems like cross attention in the encoder/decoder is the obvious choice, but I'm not sure how to best incorporate the tabular component. Options seem like 1) use a standard ffn on the tabular data and connect it to the shared latent space. But this seems not ideal if there are feature interactions between tabular and other modalities. Or, 2) put tabular data through a ffn and repeat and concatenate it to the input of each other modality. Then add some pooling layers in the decoder then a fc layer for reconstruction of the tabular data. This seems inefficient because you replicate the tabular data across modalities. Is there anything as elegant as cross attention but for cross modal fusion with tabular data? submitted by /u/WigglyHypersurface [link] [comments]  ( 1 min )
    [D] Any primer on all the major image generators and how they compare to each other?
    It seems every other week another image generator comes out. Mostly countless Dalle versions at this point. ru-Dalle, mini-Dalle, dalle flow, dalle mega. Other than that the original Dalle and dalle 2 are from Open AI and presumably are better than all the others, I know very little about them. Is there any consolidated primer on what are the major/most promising image generators, what makes them unique, what source/interface are available for them and how they stack up against each other, parameters etc? Doesn't need to be superdetailed. Just a short orientation. submitted by /u/cyborgsnowflake [link] [comments]  ( 1 min )
    [D] Is the standard path of a researcher not effective anymore?
    Hello All! I am currently a master's student. I want to work as an AI researcher, so the typical path I know of is pursuing a Ph.D. and then trying to join a respected research lab. However, I have stumbled upon a couple of discussions in blog posts, Twitter, and in this subreddit suggesting that the current progress of those labs is more engineering than science; most current research papers, including ones at top conferences, aim at the incremental improvement of current methodologies instead of breakthroughs and without any understanding of why these models work. A reason for this appears to be that research in deep learning is mainly influenced by giant tech companies that favor short-term progress instead of long-term ones. Another one appears to be the broken incentive structures in research in general and these broken incentives' bad influence have been amplified in this field leading researchers to become paper producers rather than progress producers. So, am I missing the big picture? Are there other paths? Are the incentive structures too broken that someone could make better progress by doing research on their own or by joining open labs like ML Collective? submitted by /u/TryingToGeek [link] [comments]  ( 2 min )
    [R] Locating and Editing Factual Knowledge in GPT
    https://i.redd.it/rktep95sdpz81.gif Here's an interesting paper from MIT CSAIL on a method to edit knowledge within language models — sets up a nice framework for knowledge editing and representation, and a reference dataset to test facts within a model. Also has full code & colab notebooks to edit your own facts within GPT: https://rome.baulab.info/ + https://github.com/kmeng01/rome + https://colab.research.google.com/github/kmeng01/rome/blob/main/notebooks/rome.ipynb. Abstract: We investigate the mechanisms underlying factual knowledge recall in autoregressive transformer language models. First, we develop a causal intervention for identifying neuron activations capable of altering a model’s factual predictions. Within large GPT-style models, this reveals two distinct sets of neurons that we hypothesize correspond to knowing an abstract fact and saying a concrete word, respectively. This insight inspires the development of ROME, a novel method for editing facts stored in model weights. For evaluation, we assemble COUNTERFACT, a dataset of over twenty thousand counterfactuals and tools to facilitate sensitive measurements of knowledge editing. Using COUNTERFACT, we confirm the distinction between saying and knowing neurons, and we find that ROME achieves state-of-the-art performance in knowledge editing compared to other methods. An interactive demo notebook, full code implementation, and the dataset are available at https://rome.baulab.info/. submitted by /u/Quantum_Network [link] [comments]  ( 1 min )
    [D] A Brief History & The Current State of Anime Generating AI
    submitted by /u/farfromhome2020 [link] [comments]
    [N] Is Deepmind's "Gato" a precursor for general artificial intelligence? According to Gary Marcus, most certainly not.
    submitted by /u/much_successes [link] [comments]  ( 2 min )
    [D] If random search works so well, why is there no published paper that compares it to RL?
    This is not a sarcastic question! I'm reading the excellent series An Outsider's Tour of Reinforcement Learning and was very interested in finding out that simple random search random search is actually a very good algorithm in a good variety of benchmarks. If simple random search works so well and, being fast, allows us to evaluate performance more thoroughly, why isn't there a more developed line of research exploring its application to optimal control? Why is there no paper saying "hey RL community, this is the new state of the art"? submitted by /u/GorillaWithGroove [link] [comments]  ( 2 min )
    [P] DALL·E Mega Website
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    [D] Can I re-publish a Medium post later as an arXiv paper?
    Does anyone know about legal implications of either platform? Has anybody done a similar thing, e.g. for better citation? Thanks submitted by /u/ihatebeinganonymous [link] [comments]  ( 1 min )
    [R] Symphony Generation with Permutation Invariant Language Model
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [D] What is the best practice for adding new samples to the training set of a model once a new sample is discovered/acquired?
    Let’s say I have a trained model for which I used a training set T. If new samples become available, what is the best practice for using this new information to improve the model? My thoughts are: 1) create a new training set T’ that contains the new samples and retrain the model 2) fine-tune the trained model by training a few epochs on only the new samples Furthermore, if the new acquired samples are more important to the task than the ones in the original T, is there a way I can bias the training towards the new samples? submitted by /u/StOchastiC_ [link] [comments]  ( 2 min )
    [R] Are there any publicly available models for non-autoregressive text generation?
    I am building a non-autoregressive text generation model. For evaluation I use recently proposed MAUVE score. The problem is that all previous works use different metrics, such as BLEU/self-BLEU etc. To compare my model with the previous ones I want to evaluate MAUVE score on their output. But I couldn't find any downloadable trained models for recent papers such as SUNDAE or ScratchGAN. On hugginface there are plenty of gpt-like moldels (autoregressive), but I don't know how to find non-autoregressive ones there. submitted by /u/Tomarchelone [link] [comments]  ( 1 min )
    [D] Open Source library to do automatic EDA + experiment tracking in a spreadsheet
    Hi All, We have released a new version of python library, VevestaX. The library does automatic EDA and experiment tracking in a spreadsheet. The library can be downloaded using: pip install vevestaX Following is the link to its demo: https://youtu.be/7jmnIOqBpJM Following is the github link: https://github.com/Vevesta/VevestaX/blob/main/README.md Following is the sample output spreadsheet: https://docs.google.com/spreadsheets/d/15lOXzpcUQtkYQAEnx-YTegvg8zCW6pEK/edit?usp=sharing&ouid=103382336064969333270&rtpof=true&sd=true Please give us a github star, it would mean the world for us. Please mail your feature requests to OP at vevestax@vevesta.com submitted by /u/vevesta [link] [comments]  ( 1 min )
  • Open

    Methods to guarantee stability of actor-critic based learning ?
    Hi, coming from a control systems background, is there any mathematical trickery to prove/guarantee/99% guarantee the stability of a ⠀converged policy? I am thinking something appreciate DDPG to start with? submitted by /u/_Arrietty [link] [comments]  ( 1 min )
    Methods to guarantee stability of actor-critic based learning ?
    Hi, coming from a control systems background, is there any mathematical trickery to ‏‏‎ prove/guarantee/99% guarantee the stability of a converged policy? I am thinking something appreciate DDPG to start with? submitted by /u/_Arrietty [link] [comments]  ( 1 min )
    Sampling a probabilistic action space for DQN
    I think its relatively well understood why we wouldn't sample probabilistic values for actions taken during learning: as the policy-learning is a regression problem, thus linear output (rather than activating the output with Relu/Softmax) of feature layers is necessary. However I want to know whether we can (or even should) be sampling the action space using softmax for non-learning steps. For eample, during training the DQN cycle is: record current observation -> take action -> record next observation -> save transition, (s, a, r, s') + some termination indicator -> learn step (sample batch/mini-batch of transitions to learn from). In this case, what would happen if we took a probabilistic distibution of actions (e.g. softmax on output of q-network) during exploration (the non-learning step)? Would this even do anything? ​ (Correct me if I have mispoken anywhere, thank you) submitted by /u/Background-Cable-491 [link] [comments]  ( 1 min )
    Loss doesn't decrease in Deep Q Learning
    I am training a DQL NN to learn to play tic-tac-toe against the optimal player. I am using a Huber Loss. It seems to work since the average reward increases with training, but the loss increases as well, which I was not expecting. Is this a clue that something is wrong or is it normal? submitted by /u/alesaso2000 [link] [comments]  ( 1 min )
    If we have a reward function that works perfectly for this, would adding another action such as steering have a significant effect on the possible optimal policy?
    If we have a working reward function, providing the desired behavior and optimal policy in a continuous action/state-space problem, would adding another action significantly affect the possible optimal policy? for example, assume you have an RL problem with an action space of 1 (de/acceleration), state-space of 2 (distance from position and velocity), and the agent is tasked to accelerate in a straight line from position a to b. do you think the agent would behave majorly differently? - I'm under the assumption that there would be minimal change aside from a longer training time assuming enough exploration, as the task is to still move in a straight line, but the agent would only have to account for steering action too now submitted by /u/philori [link] [comments]  ( 1 min )
    Does anyone know good python sources hardcoded of RL?
    As the title says, i would like to build my own RL agent but without using ready-to-build models such as baseline13 or tensorflow or pytorch but with hardcoded mathematical formula. submitted by /u/GarantBM [link] [comments]  ( 1 min )
    Learning policies based on mix of algorithms
    Dear RL community, I currently play around with using mixed RL algorithms on the same DRL policy network. Thus I can make use of expert MDP trajectories (e.g. BC), and offline simulation training (e.g. Off-Policy SAC) as well as online improvement (e.g. On-Policy PPO). While the algorithms greatly differ to each other, the resulting policy network tends to improve across stages, with a bump in performance during stage transitions. Have you tried something like this in before? How did it turn out for you in comparison to single algorithm approaches? submitted by /u/frugaleringenieur [link] [comments]  ( 1 min )
  • Open

    Artificial Intelligence in Social Media Enriches Marketing ROI
    Cos Leveraging Machine Learning Models in Social Media Platforms: What Motivates Social Media Marketers?   The intersections of marketing with artificial intelligence (AI) technology have opened up fascinating opportunities for marketers of all stripes. Stridently, AI has been making incessant roads into social media marketing, and rightly so. As the number of social media users rises… Read More »Artificial Intelligence in Social Media Enriches Marketing ROI The post Artificial Intelligence in Social Media Enriches Marketing ROI appeared first on Data Science Central.  ( 4 min )
    Why Gato from Deepmind is a game changer
    This week, there was (yet another) game changing announcement from the folks at Deepmind named Gato Gato is a cool cat 🙂 It leads us closer on the road to AGI because Gato is a transformer model that can do multiple things like caption images, chat with people, play games etc That means you train… Read More »Why Gato from Deepmind is a game changer The post Why Gato from Deepmind is a game changer appeared first on Data Science Central.  ( 3 min )
  • Open

    Managing Cloud Infrastructure Services With TPI Plugin For Terraform
    Terraform Provider Iterative, a new tool by Iterative for extending the capabilities of HashiCorp's Terraform, for optimizing machine learning engineers' cloud experience with automated cloud infrastructure tools: Iterative Introduces New ML Tool TPI Plugin For HashiCorp's Terraform Cloud Infrastructure Service TPI makes the Terraform for machine learning experience more consistent and less expensive with features including spot recovery improvements, automatic cleanup of unused resources, easy switching between cloud service vendors, and more. submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 1 min )
    Price recommendation for e-comm | methodology suggestions needed
    Hi everyone, Hope you are having a great weekend. I work for a small e-comm in my country as data scientist. My first project (tight timeline) is to recommend price increase for Products to give a small boost to revenue. I have fixed and curated all the transaction data that I might need but struggle with actually modelling the problem. Current approach I am working towards: 1. Build a log-log model for each product to identify Inelastic products 2. Log log model to be built for each product at daily level aggregate with cross elasticity of other products 3. Feature space for each product model; volume of the product sold in the day as dependent and other product price as price index relative to the product. Other seasonal features 3. Definition for inelasticity would be if the increase in price results in overall increase in revenue (this is because I want to take into the cross elasticity with other product and the impact of price change of the parent product on volume of unchanged product in the same category hierarchy) Challenges I am facing 1. R2 is relatively okay but finding it difficult to show some back testing to business to gain their confidence since R2 is a metric they do not understand/trust 2. When building the model at daily level not all products get sold on same day so there is a lot of missing cross product price index feature Any suggestions on improving the methodology will be very appreciated and for saving me from this pinch I would love to offer my time in future for any collaboration if needed or if it helps in any way submitted by /u/jimmyiceman [link] [comments]  ( 1 min )
    I created a DALL·E Flow website
    submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    Microsoft Research Introduces i-Code: An Integrative and Composable Multimodal Machine Learning Framework
    Machine learning has long aimed to provide models with intelligence comparable to humans. Humans can automatically blend multiple sensory inputs like visual, linguistic, and acoustic signals to generate a complete knowledge of their surroundings by virtue of their intelligence. Even the most robust pre-trained AI models, in contrast to humans, are incapable of doing so, confining themselves to one or two input modalities. Researchers have always been interested in developing effective multimodal learning strategies to support this viewpoint. In their new paper, to further support this idea, the Microsoft Azure Cognitive Services Research team proposes a self-supervised pretraining framework names i-Code: An Integrative and Composable Multimodal Learning Framework. Continue Reading Paper: https://arxiv.org/pdf/2205.01818.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AGI Survey for my Ethics Class
    I created a four question survey on the topics of medical AI ethics and artificial empathy. Answers are anonymous and responses are appreciated, thank you! Artificial Therapists submitted by /u/VoltGe [link] [comments]
    Need advice, looking for a chatbot with predefined knowledge
    Hi, I am a game developer and I am thinking about a making a game with a procedurally generated world. I'd like this world to have npcs that the player can talk to. Trouble is I can't write the dialog for all characters if the world is procedural, so the content of the player-npcs conversation is different with each play. There are two ways to solve this. a) The normal way. I can write some dialog and have it be full of variables and variants that change to match what is procedurally generated. There are big drawbacks to this though. There will still be a lot of manual work, which means this solution doesn't scale well. If I realize I need 2 times more content, i will have to do two times more work, probably more, since even the previous content will get more complex with more new variables. b) A general solution that doesn't need the manual wiring for each conversation. I've seen some great stuff with gpt-3 and gpt-neoX, but that seems like a different league entirely. My problem looks like this: The player engages in conversation with an npc. The npc, say a random trader, has some static behaviour, like prompting trade, or quests. But appart from that, the player has a bunch of predefined questions that he can ask the npc. Such as: Who are you? What's new in town? Where is the local blacksmith? .. What I need to do is collect all the data that this npc should know about, and have him interpret and understand the players question. The npc/ai recognizes that the player is asking for directions, asks the game to calculate response data (like: it's north from here, next to the tavern), and the npc/ai then synthetizes human speech like text containing the data as a response to the player. Is there a machine learning based solution that can help me implement the general solution to this problem? Thanks submitted by /u/Roggi44 [link] [comments]  ( 2 min )
    AI Researchers From Universidad Rey Juan Carlos, Spain Propose A New Method Of Contact Deformations Machine Learning For Real-Time Dynamic Simulation
    The modeling of touch and deformations has piqued computer graphics’ interest since it allows computer-generated models of persons and their surroundings to come to life. Despite significant advances in the domain, scientists still struggle to replicate high-resolution contact at interactive speeds. Many researchers are looking into ways to incorporate machine-learning approaches to model contact-driven deformations, inspired by their success in modeling self-driven deformations or deformations that occur due to an object’s motion. The methods developed so far learn rich nonlinear deformations as a function of the subspace state by using a subspace representation of the deformable object. However, the ML algorithms modeling contact deformation either simulate only smooth global contact responses or exhibit extremely restricted 3D interactions. According to the researchers, the previously developed models have some limitations. Deformations are modeled in an object-centric way, which is a good choice for self-driven deformations since it is smooth with regard to the object’s subspace state, and machine learning achieves high generalization even from sparse data. Contact-driven deformations, on the other hand, are not smooth about the object’s state. Therefore machine learning of these deformations would necessitate intensive sampling of the object’s subspace state. This is problematic because the configuration space is huge and difficult to cover. Continue Reading Paper: http://mslab.es/projects/ContactCentricLearning/contents/Romero\_SIG2022\_final.pdf Github: http://mslab.es/projects/ContactCentricLearning/ submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Ai background remover tool
    submitted by /u/Alive_Ad_2882 [link] [comments]
  • Open

    NEUROMORPHIC COMPUTING WILL NEED PARTNERS TO BREAK INTO THE DATACENTER
    submitted by /u/nnnaikl [link] [comments]

  • Open

    [P] How to do multivariate time series classification using C# and either the Accord.NET or Encog libraries?
    I have a time series based on financial security prices with additional features. I wish to feed this series into some ML construct in order to perform multi-class classification. Most of the solutions that I found in my search offer predictions. I am not interested in predicting future prices. I merely wish to train the ML construct to offer the most likely class for the given time-series input frame. I am looking for C# solutions or links to tutorials that use either the Accord.NET, Encog or ML.NET libraries. I would be most appreciative for answers that lead my eyes to view C# code that demonstrates a solution to my question. In lieu of the above, I would also appreciate a description of the types of ML constructs that would satisfy my requirements. I have no interest in Python solutions. Please, do not chastise me or praise Python. I need the code to be in C# so that it easily integrates into existing code. Thank you. Edit: I forgot to include ML.NET. submitted by /u/LeftShoeHighway [link] [comments]  ( 1 min )
    [P] I need help finding an AI that tells you what sports career is best for you.
    You ask a set of questions and it will use your answers to tell you what sports career is best for you. I have been having a hard time finding it online. Can anyone lend a hand? It would be greatly appreciated. submitted by /u/Texidork [link] [comments]  ( 1 min )
    [D] AI stocks
    The advances in AI the last two years have been mindblowing, I have taken som parttime MLclasses just to try to get a grasp. And wow, im impressed of what someone like me can do with low coding skills but high willingnes to learn. I have already built a recomendation model to improve my policy paragraphs based on input text from relevant research articles. I have to admitt that gramerly and quillbot beats my hobby project, but it was a fun run and I have gotten a ton of experience. One of them is that AI is clearly the future, and I want to place some of my investments in AI as a sector. Buy and hold for the future. The stockmarked is plumeting and will probably continue to do so for a while, but I want to start to research my options. Can you guys share your knowledge of tradable businesses, either pure AI conpanies or parentcompanies with controll? Im all for responsible trading, but feel free to share uncertain yolo companies as well. submitted by /u/sikkerhetellersafety [link] [comments]  ( 1 min )
    [P] I made an open-source demo of OpenAI's CLIP model running completely in the browser - no server involved. Compute embeddings for (and search within) a local directory of images, or search 200k popular images from Reddit (as shown in this video). Link to demo and Github repo in comments.
    submitted by /u/joerocca [link] [comments]  ( 2 min )
    Which Alg to use? [R]
    Hey, So I am taking a few images and wanting to use them to train a model to predict how much a test image matches up to the training images. Would using a CNN be my best bet or using haarcascade classifiers? Any other thoughts? I am doing this in Python on google Collab. Thanks! submitted by /u/Cloverdover1 [link] [comments]  ( 1 min )
    [Project] Volunteers Needed for Ukraine Project
    We are recruiting volunteers for a project that will help Ukraine. This is a data-oriented project, and we can use all the help we can get. We want to work very intensely on this project so we can release it quickly. To join us and help Ukraine, please reach out to [breaker25789@gmail.com](mailto:breaker25789@gmail.com) with your name, email, and the team you are interested in. Data Team · No prior skills necessary. New volunteers will receive training in identifying soldiers and military equipment upon joining our team. · This role takes a minimum of five (5) hours a week. · Minimum Age: 18+ · CONTENT WARNING: The primary role of a member of the Data Team is to directly interact with photos and videos from the war in Ukraine, which often contain graphic images of violence and death. Machine Learning Team · Each volunteer needs to be able to dedicate a minimum of ten (10) hours a week. · Preferred prior experience includes familiarity with Docker, AWS SageMaker and S3, machine learning attacks, machine learning security, dedicated red team work, and/or data science. · Minimum Age: 18+ · CONTENT WARNING: Individuals directly involved in training certain algorithms will be exposed to photos and videos from the war in Ukraine, which often contain graphic violence and death. Please notify us if you would prefer to not see that content. submitted by /u/OttersAreDevilSpawn [link] [comments]  ( 2 min )
    [D] Research Director at Deepmind says all we need now is scaling
    submitted by /u/SnoozeDoggyDog [link] [comments]  ( 5 min )
    [P] Image Fusion Techniques for Image classification Task
    Can anyone recommend sources on image fusion techniques (preferably for RGB and near-infrared images) for image classification tasks. submitted by /u/Antman-007 [link] [comments]
    [D] Taking derivative of Expectation with respect to Phi (Variational Inference)
    The snippet below from page 20 of the paper here mentions that derivative cannot be taken inside the expectation as expectation is a function of phi. ​ https://preview.redd.it/jo4e2nt0lgz81.png?width=1148&format=png&auto=webp&s=1f3acb968a07aa3858657770c12689031105a03d However, the paper here (Page 3) from the same author shows score function estimator being used to estimate gradient that takes the derivative inside expectation even when expectation is a function of phi. Highlighted in the snippet below: ​ https://preview.redd.it/fsv8s272lgz81.png?width=940&format=png&auto=webp&s=c5ad403563d1f9c8048f95a413eb188c35e785c3 I am not being able to understand the discrepancy. Could anyone please help me get insight on this? I feel that I am missing something. submitted by /u/That-Mud3051 [link] [comments]  ( 1 min )
    [D] Best resources to keep up with latest machine learning research
    I am an 'applied' machine learning researcher, i.e. about 80% of my time is on machine learning, 20% is applying it to physics problems. The increasing breadth and depth of new machine learning research is awe-inspiring. I would like to be able to keep up with the newest developments in the field, somehow, without obviously having the time to read all the latest developments. Is there a website, a resource (like weekly or monthly magazines), or community aimed at collating the newest insights and directions and publishing summaries/overviews in digestible formats? submitted by /u/intheprocesswerust [link] [comments]  ( 2 min )
  • Open

    Gato & AGI doubts
    Just read https://thenextweb.com/news/deepminds-astounding-new-gato-ai-makes-fear-humans-will-never-achieve-agi? I didn't read the whole article but based off the title I'm guessing the author of the article doubts AGI will happen because of what he sees with Gato? Why would he think that? Maybe I'm missing something submitted by /u/Ashamed-Asparagus-93 [link] [comments]  ( 1 min )
    Artificial Intelligence Books: These 10 Sci-Fi Novels You Must Read
    submitted by /u/much_successes [link] [comments]  ( 1 min )
    New ML tool to help data scientists manage cloud workloads with Terraform
    submitted by /u/thumbsdrivesmecrazy [link] [comments]
    GALLERY
    submitted by /u/cookingandcraft [link] [comments]
    Artificial Intelligence Implications: The Future of Formula One | This is my second post in a university blog series surrounding AI, The Future, and a personal area of interest! Would love some feedback and advice for future episodes!
    submitted by /u/RvZz11 [link] [comments]  ( 1 min )
    Seeking the perfect song recommendation method
    I hope you are doing as fine as you can. I am investigating and researching solutions to a problem for a few months now and I need your valuable help today. Problematic: Generate a playlist based on multiple users' music tastes. The goal is that the playlist suits everyone, at the higher level of satisfaction for every user. Research path Spotify and recommendation engines I arrived at the following conclusion: the best way to achieve it is to build a music recommendation engine, but of course, you don't want to create it by yourself because it will need a huge amount of data that I can't provide nor find, so I need to work with a third-party service that already has this data and recommendation engine, and it seems like there are not many companies that offer that service and that'…  ( 3 min )
  • Open

    Sampling with replacement until you’ve seen everything
    Suppose you have a standard deck of 52 cards. You pull out a card, put it back in the deck, shuffle, and pull out another card. How long would you expect to do this until you’ve seen every card? Here’s a variation on the same problem. Suppose you’re a park ranger keeping data on tagged […] Sampling with replacement until you’ve seen everything first appeared on John D. Cook.  ( 3 min )
2022-06-13T01:00:55.882Z osmosfeed 1.14.4